Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00
commit a57afa86b4
323 changed files with 42569 additions and 0 deletions
--- a/projects/agentic-kvcache/main.md
+++ b/projects/agentic-kvcache/main.md
--- a/projects/agentic-kvcache/ongoing.md
+++ b/projects/agentic-kvcache/ongoing.md
@@ -0,0 +1,470 @@
+
+```
+• 我已经跑完并整理了结果。结论比较明确：当前实现的 P-side priority eviction 没有带来可观收益；你的预期在机制上是对的，但当前 workload/触发条件下没有形成有效收益面。
+
+  Micro 2P2D 结果
+  同一份 trace：316 requests / 58 sessions / 2P2D / Mooncake / no-stream。
+
+  | 组别 | ok/error | mean | p50 | p90 | p99 | direct-to-D | actual KV blocks | 说明 |
+  |---|---:|---:|---:|---:|---:|---:|---:|---|
+  | PD baseline | 316/0 | 22.879s | 25.565s | 29.026s | 30.642s | 0 | 86019 | 稳定 |
+  | release, threshold=3 | 316/0 | 22.937s | 25.617s | 29.605s | 32.457s | 0 | 86019 | 稳定但无 direct 收益 |
+  | priority, threshold=3 | 316/0 | 22.897s | 25.801s | 29.393s | 32.622s | 0 | 86019 | 只有 6 个 low-priority P request |
+  | release, threshold=4 | 314/2 | 21.605s | 25.078s | 28.498s | 30.143s | 12 | 82892 | 有 KV 节省，但不稳 |
+  | priority, threshold=4 | 314/2 | 21.743s | 25.286s | 29.094s | 30.561s | 12 | 82892 | priority 未改善，略慢 |
+
+  priority 组路径：
+  outputs/live-priority-eviction/...
+  outputs/live-priority-eviction-thr4/...
+
+  Ali Filtered 结果
+  生成的 filtered trace：
+  outputs/ali-filtered-smallappend-priority-test.jsonl
+
+  规模：81 requests / 28 sessions / 53 turn2+ / max input 18901 / max output 1925 / span 5414s。
+
+  PD baseline 完成：
+
+  - ok/error: 81/0
+  - mean: 9.072s
+  - p50: 7.086s
+  - p90: 21.761s
+  - p99: 26.813s
+  - actual KV blocks: 460
+
+  但 kvcache-centric 在 Ali filtered 上两次不收敛：
+
+  - threshold=4：卡在约 58/81 个 /generate 200 OK 后，GPU 0%，无 metrics
+  - threshold=3：卡在约 62/81 个 /generate 200 OK 后，GPU 0%，无 metrics
+
+  我已中止这两组，避免等 1200s/3600s timeout。这个结果说明：Ali filtered 上当前 kvcache-centric direct/session 机制本身还不够稳，priority eviction ablation 不能成立。
+
+  是否符合预期
+  部分符合，但没有转化为性能收益。
+
+  你的假设是：P->D 之后，P 可以优先释放/淘汰这部分 session 的 prefix cache，从而提高 P 侧 prefix reuse。这个假设在机制层面成立，但实验里不明显，原因是：
+
+  1. release-after-transfer 已经解决了 P streaming session backup 与 D session duplicate 的显式重复。
+  2. 新加的 priority eviction 只影响 P radix/prefix cache 的 eviction 顺序，只有在 P radix cache 有真实压力时才会起作用。
+  3. micro threshold=3 为了稳定，几乎禁用了 direct-to-D，low-priority 只有 6 个 turn1 seed，触发面太小。
+  4. micro threshold=4 有 12 个 direct-to-D，KV block 从 86019 降到 82892，但同时出现 2 个错误，priority 本身没有进一步减少 KV 或提升 cache hit。
+  5. SGLang radix priority 对共享 prefix 是 max(priority) 传播；如果低优先级 direct session 和高优先级普通 session 共享 prefix，高优先级会保护共享节点，低优先级无法单独让共享
+     prefix 被淘汰。
+
+  下一步优化点
+  优先级应该从“只标记 P radix eviction”升级为更完整的 admission/eviction 机制：
+
+  6. 先修 Ali kvcache-centric 不收敛：给 direct-to-D / seed / close_session 加更短的 per-request timeout 和失败回退，避免 replay 卡死。
+  7. 记录 P 侧真实 radix eviction / cached-token / token usage，而不是只看请求级 cached_tokens。
+  8. 将 direct-to-D 预测从“是否 seed”改成“是否值得让 P 保留 prefix”：只有预计后续仍走 P 的 session 给高优先级。
+  9. 对 P radix cache 做 session-aware demotion，不能只靠 request priority，因为共享 radix node 会被高优先级 prefix 污染。
+  10. multi P/D 下要让 P 的低优先级策略和 D sticky/direct 策略一致，否则 P 释放了 A session，但后续又被路由回 P 做大 prefill，会反而恶化。
+
+  当前结论：kvcache-centric 能减少 P/D streaming session duplicate，但还没有证明能提高 P radix/prefix cache reuse；在 micro 上收益主要来自少量 direct-to-D 减少 KV transfer，在
+  Ali filtered 上机制稳定性还没过关。
+
+
+```
+
+
+```
+• 已跑完更完整的 multi P/D sweep，并做了 session 间隔、机制来源和优化可行性分析。结论先说：当前不能证明 kvcache-centric 在完整 workload 上稳定优于 PD-disaggregation；有效收益主要
+  来自 D 侧 session/KV reuse，而不是 P-node cache hit。更大的实验里 D 扩展明显比 P 扩展重要。
+
+  实验设置
+  使用更完整的 synthetic micro trace：outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl
+
+  - 316 requests，58 sessions，全部多轮
+  - turn 数：2 到 10，平均 5.45 turns/session
+  - 初始输入 30k tokens，后续 append 1k tokens，输出 256 tokens
+  - 原始 duration 1322s，replay 使用 --time-scale 50，所以运行时 session gap 约缩短 50 倍
+  - no-stream，concurrency-limit 1000，内部横向对比有效；不要和之前 stream 结果直接混比
+
+  Session 间隔对比
+  完整 micro：
+
+  | trace | requests | sessions | multi-turn sessions | inter-turn gap p50 | p90 | p99 |
+  |---|---:|---:|---:|---:|---:|---:|
+  | full micro 原始 | 316 | 58 | 58 | 100.5s | 123.0s | 129.4s |
+  | full micro replay 等效 | 316 | 58 | 58 | 2.01s | 2.46s | 2.59s |
+  | Ali filtered small | 24 | 7 | 7 | 2.42s | 4.65s | 10.02s |
+  | Ali normalized 10min | 142 | 41 | 6 | 4.26s | 8.50s | 379.4s |
+
+  这个说明：新的 full micro 在 replay 后的间隔已经接近 Ali filtered small，但它比 Ali normalized 更“理想化”：所有 session 都多轮，而 Ali normalized 只有 6/41 个 session 多轮，大
+  量 turn1 seed 会污染 KV/cache budget。
+
+  PD-disaggregation multi P/D 结果
+  所有 PD run 都 0 error。
+
+  | topology | mean E2E | p50 | p90 | p99 | 结论 |
+  |---|---:|---:|---:|---:|---|
+  | 1P1D | 44.125s | 50.473s | 56.743s | 58.872s | baseline |
+  | 2P1D | 42.801s | 49.016s | 54.925s | 57.446s | 只加 P，收益很小，约 3.0% |
+  | 1P2D | 29.641s | 29.261s | 45.614s | 53.433s | 只加 D，收益明显，约 32.8% |
+  | 2P2D | 22.879s | 25.563s | 29.023s | 30.612s | 最好，约 48.2% |
+
+  这组数据比 12-request micro 更可信。在这个 workload 下，D-node 数量比 P-node 数量更关键，因为 E2E 主要受 decode 并发、D 侧排队和 D 侧 session residency 影响。
+
+  kvcache-centric 结果
+  成功请求均值看起来有时更低，但有大量 error，所以不能当作有效胜利。
+
+  | admission | topology | errors | successful mean | direct-to-D | 主要问题 |
+  |---|---:|---:|---:|---:|---|
+  | worker | 1P1D | 0 | 45.333s | 0 | 比 PD 1P1D 更差 |
+  | worker | 2P1D | 1 | 43.471s | 0 | 只加 P 仍然有限 |
+  | worker | 1P2D | 33 | 25.109s | 0 | success-only 均值有偏，错误太多 |
+  | worker | 2P2D | 32 | 19.733s | 1 | 不稳定，不能算赢 |
+  | router | 1P2D | 58 | 23.536s | 0 | kvcache path error 更多 |
+  | router | 2P2D | 38 | 20.185s | 0 | 仍不稳定 |
+
+  关键观察：multi-D kvcache-centric 的成功均值低，是因为失败请求被排除了；同时 direct-to-D = 0/1，说明当前并没有稳定跑到我们真正想验证的“D 上已有 KV，后续直接打 D”机制。大部分请
+  求落在 pd-router-fallback-large-append-session-cap，也就是被 session-cap/admission 拦回 PD router。
+
+  P-node 是否更好
+  目前数据不支持“P-node 更好”这个结论。
+
+  - PD 下 1P1D 到 2P1D 只提升 44.125s -> 42.801s。
+  - PD 下 1P1D 到 1P2D 提升 44.125s -> 29.641s。
+  - kvcache-centric 的目标路径是 direct-to-D，命中后绕过 P，所以 P-node cache hit 不是主要收益来源。
+  - 当前实现里 P 可能保留 prefill backup KV，因此之前的 kvcache-centric 收益混合了 D reuse、KV transfer 减少、P backup duplicate、fallback/reseed，不能解释为纯 P cache hit。
+
+  对于 xP + yD：
+
+  - 固定 yD 只加 P，只有在 prefill 明显堵塞时有用；这次不是主瓶颈。
+  - 固定 xP 加 D，直接缓解 decode 并发、D session residency 和 sticky-D 命中后的排队，是主要收益。
+  - 多 D 下 kvcache-centric 理论上更有价值，但前提是 admission 正确、session-cap 不误杀、direct path 稳定。当前还没满足。
+
+  优化结论
+  我已经加了 seed admission 的 ablation 开关，涉及 src/agentic_pd_hybrid/replay.py、src/agentic_pd_hybrid/benchmark.py、src/agentic_pd_hybrid/cli.py，并补了 src/
+  agentic_pd_hybrid/profile.py。uv run python -m compileall src 已通过。
+
+  可行优化：
+
+  1. multi-turn-aware seed admission
+     Ali normalized oracle ablation：PD mean 19.643s，old kvcache 21.390s，oracle multi-turn seed 16.417s。这个方向有效，但生产里不能用 oracle，需要用 trace/session 特征预测哪些
+     session 会多轮。
+  2. D session-cap/admission 重做
+     当前 full micro 中 kvcache direct 几乎没有触发，主因是 session-cap/admission。需要把“是否 seed/reseed”和“是否 direct-to-D”分开 admission，并引入按 resident tokens、last
+     access、expected future turns 的 eviction。
+  3. P backup 策略显式化
+     要单独实验：
+      - seed 成功后 P 立即释放 KV
+      - hot session 才保留 P backup
+      - fallback/reseed 才使用 P backup
+        否则我们无法回答 P cache hit 到底贡献多少。
+  4. multi-D kvcache path 稳定性
+     worker/router admission 在 accelerated multi-D 下都有 500/connection 类错误。这个要先修，否则 success-only mean 没有说服力。
+  5. topology 策略
+     当前 8 GPU 约束下，优先保证 D 数量。对 agentic coding workload，下一组值得测的是 1P3D、2P3D、2P4D，看 D 扩展何时饱和；P 只作为 prefill/transfer 不堵时的最小足够资源。
+
+  最终判断：kvcache-centric 的优化来源应当是“少传 KV + D 侧 session reuse + 避免 re-prefill”，不是 P-node cache hit。当前实现还没有稳定兑现这个机制；下一步应先修 admission/
+  session-cap/direct path，再做 P backup ablation。
+
+
+```
+
+```
+• I reran the 10-minute normalized benchmark on the sampled Ali shard at outputs/ali/normalized/ali-10min-sampled-10k-append1k-out1k.jsonl. The two completed summaries are
+  outputs/ali/live-kvcache-centric-ali-10min-norm-10k1k1k/kvcache-centric-sticky-20260424T042535Z/request-metrics.jsonl.summary.json and outputs/ali/live-pd-disaggregation-ali-
+  10min-norm-10k1k1k/pd-disaggregation-sticky-20260424T044951Z/request-metrics.jsonl.summary.json.
+
+  On this workload, pd-disaggregation is better than the current router-managed kvcache-centric:
+
+  - Mean E2E: 19.64s vs 21.39s for kvcache-centric
+  - Mean TTFT: 1.15s vs 2.69s
+  - Mean TPOT: 0.01849s vs 0.01870s
+  - P90 E2E: 34.80s vs 35.30s
+  - P99 E2E is basically tied: 35.56s vs 35.53s
+
+  The important part is why. The direct-to-D path itself is good, but it is not used enough:
+
+  - kvcache-direct-to-d-session happened on only 43 requests
+  - Those 43 requests were actually better than pd-disaggregation on the same request IDs:
+      - mean E2E delta: -0.374s
+      - mean TTFT delta: -0.477s
+  - But 55 requests took pd-router-fallback-large-append, and those were much worse than pd-disaggregation:
+      - mean E2E delta: +4.445s
+      - mean TTFT delta: +4.212s
+
+  So the current loss is not “direct-to-D is bad”. It is “too few requests get the good path, and too many fall back expensively”.
+
+  A few concrete observations:
+
+  - This shard has 142 requests, 41 sessions, and only 6 multi-turn sessions.
+  - Only 2 of those 6 multi-turn sessions actually stayed on the direct-to-D path.
+  - pd-disaggregation still used prefix cache on the P-node heavily:
+      - cache_hit_request_count=139
+      - mean cached tokens 24924.5
+  - kvcache-centric reduced actual KV transfer from 25513 blocks to 21930 blocks, saving 3583 blocks, but that was not enough to offset the worse TTFT.
+
+  The most actionable diagnosis is that the router’s direct-to-D decision is still too conservative or too misaligned with actual cache state. In the bad sessions, the run
+  labeled many requests as fallback-large-append even when the P-side observed cache hit was almost the whole prompt. That means the router-side estimate for “uncached suffix is
+  small enough” is not matching what the workers can really reuse.
+
+  What makes kvcache-centric better:
+
+  - The decode node already has the session KV resident.
+  - The uncached suffix is truly small.
+  - The request can bypass P entirely.
+  - Decode-side queueing stays controlled.
+
+  What makes it worse:
+
+  - The direct-hit rate is low.
+  - Requests fall back after router-side misclassification.
+  - Decode node takes extra append-prefill pressure without enough saved transfer/prefill cost.
+  - pd-disaggregation already gets strong P-side prefix reuse, so the gap to beat is smaller than “full re-prefill”.
+
+  The next optimizations I would do are:
+
+  - Make direct-to-D eligibility depend on worker-observed reusable prefix, not only the router’s session-length heuristic.
+  - Preserve decode residency/stickiness better once a session is seeded on D.
+  - Avoid classifying requests as “large append” when the actual uncached suffix is still small after radix reuse.
+
+  If you want, I can move directly to patching the routing criterion using worker-observed prefix-hit information and rerun this same normalized benchmark.
+
+
+```
+
+
+• I fixed the mechanism in two places.
+
+  - src/agentic_pd_hybrid/replay.py now uses live decode load from /v1/loads instead of only static max_total_num_tokens, and it distinguishes direct reuse from fresh seed/
+    reseed.
+  - third_party/sglang/python/sglang/srt/disaggregation/decode.py no longer crashes on multiple reused-session requests in one decode-local extend batch.
+
+  Measured Results
+  Baseline pd-disaggregation:
+
+  - Mean TTFT 9.32s
+  - Mean TPOT 26.11 ms/token
+  - Mean E2E 35.44s
+  - Actual KV transfer 20004 blocks
+
+  Best measured kvcache-centric so far is auto4:
+
+  - Summary: outputs/micro/live-kvcache-centric-smallappend-12s-16t-20gap-rdma-auto4/kvcache-centric-sticky-20260423T125640Z/request-metrics.jsonl.summary.json
+  - Mean TTFT 5.44s
+  - Mean TPOT 25.45 ms/token
+  - Mean E2E 31.83s
+  - Actual KV transfer 6670 blocks
+  - Direct-to-D requests 160/180 reuse turns
+  - This is about 10.2% better E2E than pd-disaggregation
+  - But decode still hit KV cache pool is full twice, so it was faster but not at a safe operating point
+
+  Measured conservative live-load gate is auto5:
+
+  - Summary: outputs/micro/live-kvcache-centric-smallappend-12s-16t-20gap-rdma-auto5/kvcache-centric-sticky-20260423T162904Z/request-metrics.jsonl.summary.json
+  - Mean TTFT 23.36s
+  - Mean TPOT 19.05 ms/token
+  - Mean E2E 42.41s
+  - Actual KV transfer 15589 blocks
+  - Direct-to-D requests only 53/180
+  - Stable, but worse than pd-disaggregation
+
+  A tuned split-gate version is now implemented in src/agentic_pd_hybrid/replay.py. Its validation run auto6 is still draining, so I am not claiming results for it yet.
+
+  Serious Analysis
+  kvcache-centric is better than pd-disaggregation when:
+
+  - Turn2+ has a very small append and a very large reused prefix.
+  - The decode node already has the session KV.
+  - Decode has enough free KV blocks and is not already carrying a transfer/prealloc backlog.
+  - In that regime, it avoids both P-side append-prefill and P->D KV transfer.
+  - That is exactly why auto4 won: 160 turns went direct, KV transfer dropped by 66.7%, and turn2-4 TTFT stayed around 0.19s.
+
+  kvcache-centric becomes worse than pd-disaggregation when:
+
+  - The decode node is the bottleneck.
+  - Direct append-prefill consumes decode KV and decode compute at the same time.
+  - Some requests still go direct while others fall back to PD.
+  - That mixed regime is the worst case: direct requests occupy decode, then fallback requests pay full PD transfer cost on top of an already busy decode.
+  - That is what auto5 shows. It avoided pool-full events, but it rejected too many direct hits. Late turns then became mostly PD fallback with very large cached contexts, so
+    TTFT exploded:
+      - turn9-12 mean TTFT 33.43s
+      - turn13-16 mean TTFT 53.03s
+  - In auto5, the bad tail is not direct requests. It is fallback traffic:
+      - pd-router-fallback-d-backpressure mean TTFT 39.18s
+      - pd-router-fallback-no-d-capacity mean TTFT 32.60s
+  - So the real problem is not “direct-to-D is slow”. The problem is “once D gets too full, fallback happens too late, and fallback requests are then huge”.
+
+  Why kvcache-centric can beat PD in principle:
+
+  - It removes the extra RDMA KV hop on reuse turns.
+  - It avoids re-prefilling the whole reused prefix on P.
+  - It reduces TTFT sharply when append is small.
+  - It can reduce P-node pressure for multi-turn agentic traces.
+
+  Why it can lose in principle:
+
+  - D has finite KV capacity, while session lengths keep growing every turn.
+  - D-local append-prefill is not free; it competes with decode for the same GPU.
+  - If eviction is naive, you either keep too many stale sessions or evict useful ones and reseed later.
+  - If admission is too loose, you hit decode retraction.
+  - If admission is too conservative, you lose the direct-hit advantage and revert to slow PD fallback.
+
+  What To Optimize Next
+
+  - Use benefit-cost admission, not a binary threshold.
+      - Direct only if saved KV transfer + saved P queueing is larger than added decode-prefill cost + added decode-pressure risk.
+  - Keep separate rules for direct reuse and seed/reseed.
+      - This is now coded in src/agentic_pd_hybrid/replay.py, because seed is much more expensive than direct reuse.
+  - Add hysteresis.
+      - Stop new seeds earlier.
+      - Stop direct reuse only near true danger.
+      - Resume only after decode load falls well below the stop threshold.
+  - Make the append threshold dynamic.
+      - Low load: allow append up to 2k.
+      - Medium load: maybe 512-1k.
+      - High load: 0, force PD.
+  - Replace plain LRU eviction with reuse-aware eviction.
+      - Score by recent access, current prefix size, and expected future turns.
+      - Evict low-future-value sessions first.
+  - Prefer adding D capacity over P capacity for this workload.
+      - The late-turn problem here is decode KV pressure, not P throughput.
+      - More P helps turn1-heavy traces.
+      - More D helps reuse-heavy multi-turn traces like this one.
+  - If scaling nodes, use kvcache-centric only on sessions whose decode home is not near pressure.
+      - Otherwise route them back to PD early, before the context becomes huge.
+
+  Bottom Line
+
+  - Best measured performance result right now is still auto4, but it is not safe enough.
+  - Best measured safe result is auto5, but it is too conservative and slower than baseline PD.
+  - The correct direction is not “always direct” and not “always back off”.
+  - The correct policy is “direct reuse aggressively when D is healthy, but stop seeding much earlier than you stop direct reuse”.
+  - That split policy is now implemented in src/agentic_pd_hybrid/replay.py. The validation run for it is still in progress.
+
+
+
+
+
+| Setup                                  |       总显存 |   模型权重占用 | 可留给 KVCache 的空间（理论上限） |                     每 token KV 大小 |   可存 tokens（理论上限） | 可存 blocks（block size = 16） | 可存 blocks（block size = 32） |
+| -------------------------------------- | --------: | -------: | --------------------: | --------------------------------: | ----------------: | -------------------------: | -------------------------: |
+| **GLM-5.1-FP8 @ 8×H20 141GB**          | $1128$ GB | $756$ GB |              $372$ GB |  $89{,}856$ B $\approx 87.75$ KiB |  $\approx 4.14$ M |          $\approx 258.7$ K |          $\approx 129.4$ K |
+| **Qwen3-Coder-480B-FP8 @ 8×H20 141GB** | $1128$ GB | $482$ GB |              $646$ GB | $253{,}952$ B $\approx 248.0$ KiB |  $\approx 2.54$ M |          $\approx 159.0$ K |           $\approx 79.5$ K |
+| **GLM-5.1-FP8 @ 8×B300 288GB**         | $2304$ GB | $756$ GB |             $1548$ GB |  $89{,}856$ B $\approx 87.75$ KiB | $\approx 17.23$ M |         $\approx 1076.7$ K |          $\approx 538.4$ K |
+
+
+
+
+
+workload 出图：
+基本信息：
+1. input/output length cdf，同时给出 80% zoom in 的 cdf，input/output length 图片下方分别标注 mean/p50/p80/p90/p95/p99
+session 相关：
+2. session turns 的 cdf，给出前 10% turns 的 zoom in，同样在图片下方标注 mean/p50/p90/p95/p99 turns
+3. mean/p50/p99 请求长度随着 turns 变化的折线图（取所有 turns 相同的请求为一类进行计算）
+tool 相关：
+4. tool/user/… 各类 role 触发的请求比例的饼图，标出百分比和数据
+5. tool call 的 output length 的 cdf，同样标出 mean/p50/p90/p95/p99
+6. tool call 耗时（tool call 请求返回到 tool call 结果返回触发下一轮请求的时间间隔）的 cdf，给出 90% zoom in，同样在图片下方标出 mean/p50/p90/p95/p99
+7. 一次 user 的 input 后连续的 tool call 次数的 cdf，同样在图片下方标出 mean/p50/p90/p95/p99
+8. session 中tool call 触发的下一轮对话的新增 context（input length 相比上一轮 output length）的 cdf，同样给出 mean/p50/p90/p95/p99
+kvcache 相关：
+9. kvcache block reuse time 的 cdf（相同的 kvcache block，每次 reuse 与上一次 reuse/第一次产生 的时间间隔记作一次），给出 90% 的 zoom in，同样在图片下方标注 mean/p50/p90/p95/p99
+10. kvcache block 的生命周期（第一次出现到最后一次 reuse 的时间间隔）的 cdf，同样标出 mean/p50/p90/p95/p99
+11. 存活 kvcache blocks 数量随着时间变化的折线图，x 轴用 10 分钟一个 tick
+分桶相关：
+12. 每个分桶的请求独立只能复用相同桶内的 prefix kvcache 的情况下每个分桶内的 kvcache reuse ratio，以及不分桶，所有请求总的 kvcache reuse ratio，给出柱状图，同时标出每个分桶/不分桶的 reused blocks 总数
+13. session 前后轮请求跨分桶时引起的 kvcache miss 以及减少的 kvcache reused blocks 在对应分桶总 kvcache reuse 的比例
+
+
+
+
+
+service_name = 33e6d810.deployment.qwen-max-mainline-chat-* and response_params not null
+
+
+```json
+{
+  "kv-transfer-config":{
+    "kv_connector": "DecodeBenchConnector",
+    "kv_role": "kv_both"
+  }
+}
+```
+
+
+DS_LLM_DISAGGREGATED_DECODE_NODE
+enable_disaggregated_prefilling
+kv_transfer_config
+DS_LLM_FORCED_MAX_NEW_TOKENS=1
+DS_LLM_FORCED_MAX_NEW_TOKENS
+
+DASHGEN_DEPLOYMENT_ROLE=prefill
+
+```
+export DASH_GEN_ARGS=${DASH_GEN_ARGS:-'{"enable_disaggregated_prefilling": 1, "enable_think": 0, "think_mode": "auto", "gpu_memory_utilization": 0.85, "max_model_len": 262144, "enable_chunked_prefill": true, "speculative_config": {"method": "eagle3", "num_speculative_tokens": 1, "hf_overrides": {"rope_scaling": {"type": "yarn", "factor": 128, "original_max_position_embeddings": 2048, "semi_dynamic": false, "dynamic": true}, "num_experts": 0}, "model": "/home/admin/resource/model/464482ce.qwen3-235b-a22b/0717-eagle-0820"}, "swap_space": 0, "enable_prefix_caching": true, "max_num_batched_tokens": 8192, "enable_expert_parallel": false, "disable_hybrid_kv_cache_manager": true, "block_size": 64, "max_num_seqs": 64, "enforce_eager": false, "quantization": "fp8", "cuda_graph_sizes": [16, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800, 832, 864, 896, 928, 960, 992, 1024], "compilation_config": {"cudagraph_mode": "PIECEWISE", "use_inductor": false, "custom_ops": ["all"], "max_cudagraph_capture_size": 2048}, "kv_transfer_config": {"kv_connector": "HybridConnector", "kv_role": "kv_producer", "kv_connector_extra_config": {"backend": "kvt"}}, "hf_overrides": {"architectures": ["Qwen3MoeForCausalLM"], "model_type": "qwen3_moe"}, "kv_cache_dtype": "fp8"}'}
+
+export DASH_GEN_ARGS=${DASH_GEN_ARGS:-'{"enable_disaggregated_prefilling": false, "enable_think": 0, "think_mode": "auto", "gpu_memory_utilization": 0.85, "max_model_len": 262144, "enable_chunked_prefill": true, "speculative_config": {"method": "eagle3", "num_speculative_tokens": 1, "hf_overrides": {"rope_scaling": {"type": "yarn", "factor": 128, "original_max_position_embeddings": 2048, "semi_dynamic": false, "dynamic": true}, "num_experts": 0}, "model": "/home/admin/resource/model/464482ce.qwen3-235b-a22b/0717-eagle-0820"}, "swap_space": 0, "enable_prefix_caching": true, "max_num_batched_tokens": 8192, "enable_expert_parallel": false, "disable_hybrid_kv_cache_manager": true, "block_size": 64, "max_num_seqs": 64, "enforce_eager": false, "quantization": "fp8", "cuda_graph_sizes": [16, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800, 832, 864, 896, 928, 960, 992, 1024], "compilation_config": {"cudagraph_mode": "PIECEWISE", "use_inductor": false, "custom_ops": ["all"], "max_cudagraph_capture_size": 2048}, "kv_transfer_config": "", "hf_overrides": {"architectures": ["Qwen3MoeForCausalLM"], "model_type": "qwen3_moe"}, "kv_cache_dtype": "fp8"}'}
+```
+
+## coding agent 的 trace analysis
+
+1. GLM5 040909-040911
+2. GLM5 040915-040917
+3. Qwen3-Coder 040915-040917
+
+## Ali kvcache 现状
+
+GLM 5 分桶：
+0-32k
+32k-85k
+85k-128k
+128k++
+
+问题：
+- 调度器如何影响 kvcache 命中率？5min TTL 会不会引起误判
+- kvcache 通过 vineyard 传输的开销
+- vineyard remote cache 的价值
+
+
+
+trace 分析：
+1. input/output length 的 CDF
+2. session turns 的 CDF
+3. session prev turn -> next 转移的触发原因的占比柱状图
+4. kvcache block 的重用时间 CDF
+5. kvcache lifespan 的 CDF
+6. tool call 耗时的 CDF
+7. tool call 的新增 tokens 分布
+8. tool call 的新增 tokens，相比当前 context 的长度占比
+
+
+
+---
+dash0 qwen235b prefill-only thinking tuning step slo246
+dash1 上的 qwen27b chat 0-8k tuned best vs baseline
+dash2 qwen235b decode-only thinking tuning with 0323
+
+baseline: 0414, 5~7 点的测试监控
+
+
+
+
+
+
+---
+```
+Turn 1 Input:  1 2 3         In trace: 1 2 3
+Turn 1 Output: 4 5 6
+Turn 2 Input   7 8           Append Only Context: 1 2 3 4 5 6 7 8
+                             Real Context: 1 2 4 5 6 7 8
+				                      Hit: 1 2
+Turn 2 Output: 9
+Turn 3 Input:  10 11         Append Only Context: 1 2 3 4 5 6 7 8 9 10 11
+                             Real Context: 1 2 4 5 6 7 9 10 11
+				                      Hit: 1 2 4 5 6 7
+
+```
+
+
+```
+Turn 1 Input:  1 2 3         In trace: 1 2 3
+Turn 1 Output: 90 91 92      Output in trace: 4 5 6
+Turn 2 Input   7 8           In trace: 1 2 3 4  5  6  7 8
+               Real Context in Engine: 1 2 3 90 91 92 7 8
+				                  Hit: 1 2 3
+
+```
--- a/projects/agentic-kvcache/outline.md
+++ b/projects/agentic-kvcache/outline.md
--- a/projects/agentic-kvcache/related-works.md
+++ b/projects/agentic-kvcache/related-works.md
@@ -0,0 +1,39 @@
+ LAPS: A Length-Aware-Prefill LLM Serving System
+ https://arxiv.org/abs/2601.11589
+
+
+Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
+https://arxiv.org/abs/2601.19910
+
+
+
+DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference
+https://arxiv.org/abs/2602.21548
+
+
+DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
+https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf
+
+
+Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving
+https://arxiv.org/pdf/2603.13358
+
+
+Efficient Multi-round LLM Inference over Disaggregated Serving
+https://arxiv.org/pdf/2602.14516v1
+
+
+ Roadmap: SGLang Distributed KVCache System For Agentic Workload
+ https://github.com/sgl-project/sglang/issues/21846
+ 
+ RFC: P/D Prefill compute optimizations with bi-directional KV cache transfers between P and D nodes
+ https://github.com/vllm-project/vllm/issues/32733
+
+
+
+- [Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving](https://arxiv.org/html/2603.13358v1?utm_source=chatgpt.com)
+- [SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents](https://arxiv.org/html/2602.09447v1?utm_source=chatgpt.com)
+- [DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference](https://arxiv.org/html/2602.21548v1?utm_source=chatgpt.com)
+- [Prefill/Decode Disaggregation | llm-d](https://llm-d.ai/docs/guide/Installation/pd-disaggregation?utm_source=chatgpt.com)
+- [CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serving | USENIX](https://www.usenix.org/conference/fast26/presentation/liu-yang?utm_source=chatgpt.com)
+
--- a/projects/agentic-kvcache/roadmap.md
+++ b/projects/agentic-kvcache/roadmap.md
--- a/projects/agentic-kvcache/sync.md
+++ b/projects/agentic-kvcache/sync.md
@@ -0,0 +1,25 @@
+| 分桶      |     请求数 |    SLA | 实例数 | estimated_ttft mean TTFT | 无限空间上限 |
+| ------- | ------: | -----: | --: | -----------------------: | -----: |
+| 0-32k   | 637,142 |  <= 5s |  64 |                   0.502s | 59.63% |
+| 32-85k  |  99,735 | <= 10s |  48 |                   2.801s | 82.35% |
+| 85-128k |  23,624 | <= 15s |  16 |                   9.669s | 84.25% |
+| 128k+   |   3,226 | <= 20s |   6 |                   9.572s | 82.99% |
+
+| 分桶      | 最优路由                     | 最优 TTFT / Hit / Gap      | cache_score               | cache_score_strong        |
+| ------- | ------------------------ | ------------------------ | ------------------------- | ------------------------- |
+| 0-32k   | cache_affinity_weak_rend | 0.488s / 56.11% / 3.52pp | 0.536s / 54.45% / 5.18pp  | 0.813s / 56.97% / 2.66pp  |
+| 32-85k  | estimated_ttft           | 2.801s / 76.70% / 5.66pp | 3.766s / 77.52% / 4.83pp  | 5.193s / 78.00% / 4.35pp  |
+| 85-128k | cache_affinity_weak_rend | 9.289s / 77.12% / 7.13pp | 9.408s / 77.07% / 7.18pp  | 11.906s / 76.87% / 7.38pp |
+| 128k+   | estimated_ttft           | 9.572s / 74.44% / 8.54pp | 10.630s / 74.56% / 8.42pp | 11.481s / 74.39% / 8.59pp |
+
+  cache_score_strong 在 Qwen3 上并不占优。它只在 0-32k 和 32-85k 上拿到了略高的 hit ratio，但代价是更差的 TTFT；而在 85-128k 和 128k+ 上，它连命中率都没有优势，TTFT 还更差。也就是
+  说，Qwen3 上“更激进地追 cache”并没有换来稳定收益。
+
+  cache_score 比 cache_score_strong 更稳。在四个桶里，它都比 cache_score_strong 有更好的 TTFT；命中率上和 cache_score_strong 很接近，甚至在长桶更好。如果只在 cache_score 和
+  cache_score_strong 之间选，Qwen3 上应优先 cache_score。
+
+  全策略最优并不统一。0-32k 和 85-128k 最优是 cache_affinity_weak_rend，32-85k 和 128k+ 最优是 estimated_ttft。这说明 Qwen3 上不存在一个单一 policy 可以统治所有长度段，分桶后做差
+  异化路由是有价值的。
+
+  从 gap 看，真正的主要问题不在 eviction，而在 workload ceiling 本身和在线放置策略。0-32k 的 ceiling 太低，在线路由再怎么优化也只能在 60% 左右附近打转；而中长桶 ceiling 很高，但当
+  前最优在线策略仍然比无限空间上限差 5.7pp 到 8.5pp，说明还有 routing/placement headroom，不过不是 cache_score_strong 这条路。