Initial commit: obsidian to gitea
This commit is contained in:
0
projects/agentic-kvcache/main.md
Normal file
0
projects/agentic-kvcache/main.md
Normal file
470
projects/agentic-kvcache/ongoing.md
Normal file
470
projects/agentic-kvcache/ongoing.md
Normal file
@@ -0,0 +1,470 @@
|
||||
|
||||
```
|
||||
• 我已经跑完并整理了结果。结论比较明确:当前实现的 P-side priority eviction 没有带来可观收益;你的预期在机制上是对的,但当前 workload/触发条件下没有形成有效收益面。
|
||||
|
||||
Micro 2P2D 结果
|
||||
同一份 trace:316 requests / 58 sessions / 2P2D / Mooncake / no-stream。
|
||||
|
||||
| 组别 | ok/error | mean | p50 | p90 | p99 | direct-to-D | actual KV blocks | 说明 |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|---|
|
||||
| PD baseline | 316/0 | 22.879s | 25.565s | 29.026s | 30.642s | 0 | 86019 | 稳定 |
|
||||
| release, threshold=3 | 316/0 | 22.937s | 25.617s | 29.605s | 32.457s | 0 | 86019 | 稳定但无 direct 收益 |
|
||||
| priority, threshold=3 | 316/0 | 22.897s | 25.801s | 29.393s | 32.622s | 0 | 86019 | 只有 6 个 low-priority P request |
|
||||
| release, threshold=4 | 314/2 | 21.605s | 25.078s | 28.498s | 30.143s | 12 | 82892 | 有 KV 节省,但不稳 |
|
||||
| priority, threshold=4 | 314/2 | 21.743s | 25.286s | 29.094s | 30.561s | 12 | 82892 | priority 未改善,略慢 |
|
||||
|
||||
priority 组路径:
|
||||
outputs/live-priority-eviction/...
|
||||
outputs/live-priority-eviction-thr4/...
|
||||
|
||||
Ali Filtered 结果
|
||||
生成的 filtered trace:
|
||||
outputs/ali-filtered-smallappend-priority-test.jsonl
|
||||
|
||||
规模:81 requests / 28 sessions / 53 turn2+ / max input 18901 / max output 1925 / span 5414s。
|
||||
|
||||
PD baseline 完成:
|
||||
|
||||
- ok/error: 81/0
|
||||
- mean: 9.072s
|
||||
- p50: 7.086s
|
||||
- p90: 21.761s
|
||||
- p99: 26.813s
|
||||
- actual KV blocks: 460
|
||||
|
||||
但 kvcache-centric 在 Ali filtered 上两次不收敛:
|
||||
|
||||
- threshold=4:卡在约 58/81 个 /generate 200 OK 后,GPU 0%,无 metrics
|
||||
- threshold=3:卡在约 62/81 个 /generate 200 OK 后,GPU 0%,无 metrics
|
||||
|
||||
我已中止这两组,避免等 1200s/3600s timeout。这个结果说明:Ali filtered 上当前 kvcache-centric direct/session 机制本身还不够稳,priority eviction ablation 不能成立。
|
||||
|
||||
是否符合预期
|
||||
部分符合,但没有转化为性能收益。
|
||||
|
||||
你的假设是:P->D 之后,P 可以优先释放/淘汰这部分 session 的 prefix cache,从而提高 P 侧 prefix reuse。这个假设在机制层面成立,但实验里不明显,原因是:
|
||||
|
||||
1. release-after-transfer 已经解决了 P streaming session backup 与 D session duplicate 的显式重复。
|
||||
2. 新加的 priority eviction 只影响 P radix/prefix cache 的 eviction 顺序,只有在 P radix cache 有真实压力时才会起作用。
|
||||
3. micro threshold=3 为了稳定,几乎禁用了 direct-to-D,low-priority 只有 6 个 turn1 seed,触发面太小。
|
||||
4. micro threshold=4 有 12 个 direct-to-D,KV block 从 86019 降到 82892,但同时出现 2 个错误,priority 本身没有进一步减少 KV 或提升 cache hit。
|
||||
5. SGLang radix priority 对共享 prefix 是 max(priority) 传播;如果低优先级 direct session 和高优先级普通 session 共享 prefix,高优先级会保护共享节点,低优先级无法单独让共享
|
||||
prefix 被淘汰。
|
||||
|
||||
下一步优化点
|
||||
优先级应该从“只标记 P radix eviction”升级为更完整的 admission/eviction 机制:
|
||||
|
||||
6. 先修 Ali kvcache-centric 不收敛:给 direct-to-D / seed / close_session 加更短的 per-request timeout 和失败回退,避免 replay 卡死。
|
||||
7. 记录 P 侧真实 radix eviction / cached-token / token usage,而不是只看请求级 cached_tokens。
|
||||
8. 将 direct-to-D 预测从“是否 seed”改成“是否值得让 P 保留 prefix”:只有预计后续仍走 P 的 session 给高优先级。
|
||||
9. 对 P radix cache 做 session-aware demotion,不能只靠 request priority,因为共享 radix node 会被高优先级 prefix 污染。
|
||||
10. multi P/D 下要让 P 的低优先级策略和 D sticky/direct 策略一致,否则 P 释放了 A session,但后续又被路由回 P 做大 prefill,会反而恶化。
|
||||
|
||||
当前结论:kvcache-centric 能减少 P/D streaming session duplicate,但还没有证明能提高 P radix/prefix cache reuse;在 micro 上收益主要来自少量 direct-to-D 减少 KV transfer,在
|
||||
Ali filtered 上机制稳定性还没过关。
|
||||
|
||||
|
||||
```
|
||||
|
||||
|
||||
```
|
||||
• 已跑完更完整的 multi P/D sweep,并做了 session 间隔、机制来源和优化可行性分析。结论先说:当前不能证明 kvcache-centric 在完整 workload 上稳定优于 PD-disaggregation;有效收益主要
|
||||
来自 D 侧 session/KV reuse,而不是 P-node cache hit。更大的实验里 D 扩展明显比 P 扩展重要。
|
||||
|
||||
实验设置
|
||||
使用更完整的 synthetic micro trace:outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl
|
||||
|
||||
- 316 requests,58 sessions,全部多轮
|
||||
- turn 数:2 到 10,平均 5.45 turns/session
|
||||
- 初始输入 30k tokens,后续 append 1k tokens,输出 256 tokens
|
||||
- 原始 duration 1322s,replay 使用 --time-scale 50,所以运行时 session gap 约缩短 50 倍
|
||||
- no-stream,concurrency-limit 1000,内部横向对比有效;不要和之前 stream 结果直接混比
|
||||
|
||||
Session 间隔对比
|
||||
完整 micro:
|
||||
|
||||
| trace | requests | sessions | multi-turn sessions | inter-turn gap p50 | p90 | p99 |
|
||||
|---|---:|---:|---:|---:|---:|---:|
|
||||
| full micro 原始 | 316 | 58 | 58 | 100.5s | 123.0s | 129.4s |
|
||||
| full micro replay 等效 | 316 | 58 | 58 | 2.01s | 2.46s | 2.59s |
|
||||
| Ali filtered small | 24 | 7 | 7 | 2.42s | 4.65s | 10.02s |
|
||||
| Ali normalized 10min | 142 | 41 | 6 | 4.26s | 8.50s | 379.4s |
|
||||
|
||||
这个说明:新的 full micro 在 replay 后的间隔已经接近 Ali filtered small,但它比 Ali normalized 更“理想化”:所有 session 都多轮,而 Ali normalized 只有 6/41 个 session 多轮,大
|
||||
量 turn1 seed 会污染 KV/cache budget。
|
||||
|
||||
PD-disaggregation multi P/D 结果
|
||||
所有 PD run 都 0 error。
|
||||
|
||||
| topology | mean E2E | p50 | p90 | p99 | 结论 |
|
||||
|---|---:|---:|---:|---:|---|
|
||||
| 1P1D | 44.125s | 50.473s | 56.743s | 58.872s | baseline |
|
||||
| 2P1D | 42.801s | 49.016s | 54.925s | 57.446s | 只加 P,收益很小,约 3.0% |
|
||||
| 1P2D | 29.641s | 29.261s | 45.614s | 53.433s | 只加 D,收益明显,约 32.8% |
|
||||
| 2P2D | 22.879s | 25.563s | 29.023s | 30.612s | 最好,约 48.2% |
|
||||
|
||||
这组数据比 12-request micro 更可信。在这个 workload 下,D-node 数量比 P-node 数量更关键,因为 E2E 主要受 decode 并发、D 侧排队和 D 侧 session residency 影响。
|
||||
|
||||
kvcache-centric 结果
|
||||
成功请求均值看起来有时更低,但有大量 error,所以不能当作有效胜利。
|
||||
|
||||
| admission | topology | errors | successful mean | direct-to-D | 主要问题 |
|
||||
|---|---:|---:|---:|---:|---|
|
||||
| worker | 1P1D | 0 | 45.333s | 0 | 比 PD 1P1D 更差 |
|
||||
| worker | 2P1D | 1 | 43.471s | 0 | 只加 P 仍然有限 |
|
||||
| worker | 1P2D | 33 | 25.109s | 0 | success-only 均值有偏,错误太多 |
|
||||
| worker | 2P2D | 32 | 19.733s | 1 | 不稳定,不能算赢 |
|
||||
| router | 1P2D | 58 | 23.536s | 0 | kvcache path error 更多 |
|
||||
| router | 2P2D | 38 | 20.185s | 0 | 仍不稳定 |
|
||||
|
||||
关键观察:multi-D kvcache-centric 的成功均值低,是因为失败请求被排除了;同时 direct-to-D = 0/1,说明当前并没有稳定跑到我们真正想验证的“D 上已有 KV,后续直接打 D”机制。大部分请
|
||||
求落在 pd-router-fallback-large-append-session-cap,也就是被 session-cap/admission 拦回 PD router。
|
||||
|
||||
P-node 是否更好
|
||||
目前数据不支持“P-node 更好”这个结论。
|
||||
|
||||
- PD 下 1P1D 到 2P1D 只提升 44.125s -> 42.801s。
|
||||
- PD 下 1P1D 到 1P2D 提升 44.125s -> 29.641s。
|
||||
- kvcache-centric 的目标路径是 direct-to-D,命中后绕过 P,所以 P-node cache hit 不是主要收益来源。
|
||||
- 当前实现里 P 可能保留 prefill backup KV,因此之前的 kvcache-centric 收益混合了 D reuse、KV transfer 减少、P backup duplicate、fallback/reseed,不能解释为纯 P cache hit。
|
||||
|
||||
对于 xP + yD:
|
||||
|
||||
- 固定 yD 只加 P,只有在 prefill 明显堵塞时有用;这次不是主瓶颈。
|
||||
- 固定 xP 加 D,直接缓解 decode 并发、D session residency 和 sticky-D 命中后的排队,是主要收益。
|
||||
- 多 D 下 kvcache-centric 理论上更有价值,但前提是 admission 正确、session-cap 不误杀、direct path 稳定。当前还没满足。
|
||||
|
||||
优化结论
|
||||
我已经加了 seed admission 的 ablation 开关,涉及 src/agentic_pd_hybrid/replay.py、src/agentic_pd_hybrid/benchmark.py、src/agentic_pd_hybrid/cli.py,并补了 src/
|
||||
agentic_pd_hybrid/profile.py。uv run python -m compileall src 已通过。
|
||||
|
||||
可行优化:
|
||||
|
||||
1. multi-turn-aware seed admission
|
||||
Ali normalized oracle ablation:PD mean 19.643s,old kvcache 21.390s,oracle multi-turn seed 16.417s。这个方向有效,但生产里不能用 oracle,需要用 trace/session 特征预测哪些
|
||||
session 会多轮。
|
||||
2. D session-cap/admission 重做
|
||||
当前 full micro 中 kvcache direct 几乎没有触发,主因是 session-cap/admission。需要把“是否 seed/reseed”和“是否 direct-to-D”分开 admission,并引入按 resident tokens、last
|
||||
access、expected future turns 的 eviction。
|
||||
3. P backup 策略显式化
|
||||
要单独实验:
|
||||
- seed 成功后 P 立即释放 KV
|
||||
- hot session 才保留 P backup
|
||||
- fallback/reseed 才使用 P backup
|
||||
否则我们无法回答 P cache hit 到底贡献多少。
|
||||
4. multi-D kvcache path 稳定性
|
||||
worker/router admission 在 accelerated multi-D 下都有 500/connection 类错误。这个要先修,否则 success-only mean 没有说服力。
|
||||
5. topology 策略
|
||||
当前 8 GPU 约束下,优先保证 D 数量。对 agentic coding workload,下一组值得测的是 1P3D、2P3D、2P4D,看 D 扩展何时饱和;P 只作为 prefill/transfer 不堵时的最小足够资源。
|
||||
|
||||
最终判断:kvcache-centric 的优化来源应当是“少传 KV + D 侧 session reuse + 避免 re-prefill”,不是 P-node cache hit。当前实现还没有稳定兑现这个机制;下一步应先修 admission/
|
||||
session-cap/direct path,再做 P backup ablation。
|
||||
|
||||
|
||||
```
|
||||
|
||||
```
|
||||
• I reran the 10-minute normalized benchmark on the sampled Ali shard at outputs/ali/normalized/ali-10min-sampled-10k-append1k-out1k.jsonl. The two completed summaries are
|
||||
outputs/ali/live-kvcache-centric-ali-10min-norm-10k1k1k/kvcache-centric-sticky-20260424T042535Z/request-metrics.jsonl.summary.json and outputs/ali/live-pd-disaggregation-ali-
|
||||
10min-norm-10k1k1k/pd-disaggregation-sticky-20260424T044951Z/request-metrics.jsonl.summary.json.
|
||||
|
||||
On this workload, pd-disaggregation is better than the current router-managed kvcache-centric:
|
||||
|
||||
- Mean E2E: 19.64s vs 21.39s for kvcache-centric
|
||||
- Mean TTFT: 1.15s vs 2.69s
|
||||
- Mean TPOT: 0.01849s vs 0.01870s
|
||||
- P90 E2E: 34.80s vs 35.30s
|
||||
- P99 E2E is basically tied: 35.56s vs 35.53s
|
||||
|
||||
The important part is why. The direct-to-D path itself is good, but it is not used enough:
|
||||
|
||||
- kvcache-direct-to-d-session happened on only 43 requests
|
||||
- Those 43 requests were actually better than pd-disaggregation on the same request IDs:
|
||||
- mean E2E delta: -0.374s
|
||||
- mean TTFT delta: -0.477s
|
||||
- But 55 requests took pd-router-fallback-large-append, and those were much worse than pd-disaggregation:
|
||||
- mean E2E delta: +4.445s
|
||||
- mean TTFT delta: +4.212s
|
||||
|
||||
So the current loss is not “direct-to-D is bad”. It is “too few requests get the good path, and too many fall back expensively”.
|
||||
|
||||
A few concrete observations:
|
||||
|
||||
- This shard has 142 requests, 41 sessions, and only 6 multi-turn sessions.
|
||||
- Only 2 of those 6 multi-turn sessions actually stayed on the direct-to-D path.
|
||||
- pd-disaggregation still used prefix cache on the P-node heavily:
|
||||
- cache_hit_request_count=139
|
||||
- mean cached tokens 24924.5
|
||||
- kvcache-centric reduced actual KV transfer from 25513 blocks to 21930 blocks, saving 3583 blocks, but that was not enough to offset the worse TTFT.
|
||||
|
||||
The most actionable diagnosis is that the router’s direct-to-D decision is still too conservative or too misaligned with actual cache state. In the bad sessions, the run
|
||||
labeled many requests as fallback-large-append even when the P-side observed cache hit was almost the whole prompt. That means the router-side estimate for “uncached suffix is
|
||||
small enough” is not matching what the workers can really reuse.
|
||||
|
||||
What makes kvcache-centric better:
|
||||
|
||||
- The decode node already has the session KV resident.
|
||||
- The uncached suffix is truly small.
|
||||
- The request can bypass P entirely.
|
||||
- Decode-side queueing stays controlled.
|
||||
|
||||
What makes it worse:
|
||||
|
||||
- The direct-hit rate is low.
|
||||
- Requests fall back after router-side misclassification.
|
||||
- Decode node takes extra append-prefill pressure without enough saved transfer/prefill cost.
|
||||
- pd-disaggregation already gets strong P-side prefix reuse, so the gap to beat is smaller than “full re-prefill”.
|
||||
|
||||
The next optimizations I would do are:
|
||||
|
||||
- Make direct-to-D eligibility depend on worker-observed reusable prefix, not only the router’s session-length heuristic.
|
||||
- Preserve decode residency/stickiness better once a session is seeded on D.
|
||||
- Avoid classifying requests as “large append” when the actual uncached suffix is still small after radix reuse.
|
||||
|
||||
If you want, I can move directly to patching the routing criterion using worker-observed prefix-hit information and rerun this same normalized benchmark.
|
||||
|
||||
|
||||
```
|
||||
|
||||
|
||||
• I fixed the mechanism in two places.
|
||||
|
||||
- src/agentic_pd_hybrid/replay.py now uses live decode load from /v1/loads instead of only static max_total_num_tokens, and it distinguishes direct reuse from fresh seed/
|
||||
reseed.
|
||||
- third_party/sglang/python/sglang/srt/disaggregation/decode.py no longer crashes on multiple reused-session requests in one decode-local extend batch.
|
||||
|
||||
Measured Results
|
||||
Baseline pd-disaggregation:
|
||||
|
||||
- Mean TTFT 9.32s
|
||||
- Mean TPOT 26.11 ms/token
|
||||
- Mean E2E 35.44s
|
||||
- Actual KV transfer 20004 blocks
|
||||
|
||||
Best measured kvcache-centric so far is auto4:
|
||||
|
||||
- Summary: outputs/micro/live-kvcache-centric-smallappend-12s-16t-20gap-rdma-auto4/kvcache-centric-sticky-20260423T125640Z/request-metrics.jsonl.summary.json
|
||||
- Mean TTFT 5.44s
|
||||
- Mean TPOT 25.45 ms/token
|
||||
- Mean E2E 31.83s
|
||||
- Actual KV transfer 6670 blocks
|
||||
- Direct-to-D requests 160/180 reuse turns
|
||||
- This is about 10.2% better E2E than pd-disaggregation
|
||||
- But decode still hit KV cache pool is full twice, so it was faster but not at a safe operating point
|
||||
|
||||
Measured conservative live-load gate is auto5:
|
||||
|
||||
- Summary: outputs/micro/live-kvcache-centric-smallappend-12s-16t-20gap-rdma-auto5/kvcache-centric-sticky-20260423T162904Z/request-metrics.jsonl.summary.json
|
||||
- Mean TTFT 23.36s
|
||||
- Mean TPOT 19.05 ms/token
|
||||
- Mean E2E 42.41s
|
||||
- Actual KV transfer 15589 blocks
|
||||
- Direct-to-D requests only 53/180
|
||||
- Stable, but worse than pd-disaggregation
|
||||
|
||||
A tuned split-gate version is now implemented in src/agentic_pd_hybrid/replay.py. Its validation run auto6 is still draining, so I am not claiming results for it yet.
|
||||
|
||||
Serious Analysis
|
||||
kvcache-centric is better than pd-disaggregation when:
|
||||
|
||||
- Turn2+ has a very small append and a very large reused prefix.
|
||||
- The decode node already has the session KV.
|
||||
- Decode has enough free KV blocks and is not already carrying a transfer/prealloc backlog.
|
||||
- In that regime, it avoids both P-side append-prefill and P->D KV transfer.
|
||||
- That is exactly why auto4 won: 160 turns went direct, KV transfer dropped by 66.7%, and turn2-4 TTFT stayed around 0.19s.
|
||||
|
||||
kvcache-centric becomes worse than pd-disaggregation when:
|
||||
|
||||
- The decode node is the bottleneck.
|
||||
- Direct append-prefill consumes decode KV and decode compute at the same time.
|
||||
- Some requests still go direct while others fall back to PD.
|
||||
- That mixed regime is the worst case: direct requests occupy decode, then fallback requests pay full PD transfer cost on top of an already busy decode.
|
||||
- That is what auto5 shows. It avoided pool-full events, but it rejected too many direct hits. Late turns then became mostly PD fallback with very large cached contexts, so
|
||||
TTFT exploded:
|
||||
- turn9-12 mean TTFT 33.43s
|
||||
- turn13-16 mean TTFT 53.03s
|
||||
- In auto5, the bad tail is not direct requests. It is fallback traffic:
|
||||
- pd-router-fallback-d-backpressure mean TTFT 39.18s
|
||||
- pd-router-fallback-no-d-capacity mean TTFT 32.60s
|
||||
- So the real problem is not “direct-to-D is slow”. The problem is “once D gets too full, fallback happens too late, and fallback requests are then huge”.
|
||||
|
||||
Why kvcache-centric can beat PD in principle:
|
||||
|
||||
- It removes the extra RDMA KV hop on reuse turns.
|
||||
- It avoids re-prefilling the whole reused prefix on P.
|
||||
- It reduces TTFT sharply when append is small.
|
||||
- It can reduce P-node pressure for multi-turn agentic traces.
|
||||
|
||||
Why it can lose in principle:
|
||||
|
||||
- D has finite KV capacity, while session lengths keep growing every turn.
|
||||
- D-local append-prefill is not free; it competes with decode for the same GPU.
|
||||
- If eviction is naive, you either keep too many stale sessions or evict useful ones and reseed later.
|
||||
- If admission is too loose, you hit decode retraction.
|
||||
- If admission is too conservative, you lose the direct-hit advantage and revert to slow PD fallback.
|
||||
|
||||
What To Optimize Next
|
||||
|
||||
- Use benefit-cost admission, not a binary threshold.
|
||||
- Direct only if saved KV transfer + saved P queueing is larger than added decode-prefill cost + added decode-pressure risk.
|
||||
- Keep separate rules for direct reuse and seed/reseed.
|
||||
- This is now coded in src/agentic_pd_hybrid/replay.py, because seed is much more expensive than direct reuse.
|
||||
- Add hysteresis.
|
||||
- Stop new seeds earlier.
|
||||
- Stop direct reuse only near true danger.
|
||||
- Resume only after decode load falls well below the stop threshold.
|
||||
- Make the append threshold dynamic.
|
||||
- Low load: allow append up to 2k.
|
||||
- Medium load: maybe 512-1k.
|
||||
- High load: 0, force PD.
|
||||
- Replace plain LRU eviction with reuse-aware eviction.
|
||||
- Score by recent access, current prefix size, and expected future turns.
|
||||
- Evict low-future-value sessions first.
|
||||
- Prefer adding D capacity over P capacity for this workload.
|
||||
- The late-turn problem here is decode KV pressure, not P throughput.
|
||||
- More P helps turn1-heavy traces.
|
||||
- More D helps reuse-heavy multi-turn traces like this one.
|
||||
- If scaling nodes, use kvcache-centric only on sessions whose decode home is not near pressure.
|
||||
- Otherwise route them back to PD early, before the context becomes huge.
|
||||
|
||||
Bottom Line
|
||||
|
||||
- Best measured performance result right now is still auto4, but it is not safe enough.
|
||||
- Best measured safe result is auto5, but it is too conservative and slower than baseline PD.
|
||||
- The correct direction is not “always direct” and not “always back off”.
|
||||
- The correct policy is “direct reuse aggressively when D is healthy, but stop seeding much earlier than you stop direct reuse”.
|
||||
- That split policy is now implemented in src/agentic_pd_hybrid/replay.py. The validation run for it is still in progress.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
| Setup | 总显存 | 模型权重占用 | 可留给 KVCache 的空间(理论上限) | 每 token KV 大小 | 可存 tokens(理论上限) | 可存 blocks(block size = 16) | 可存 blocks(block size = 32) |
|
||||
| -------------------------------------- | --------: | -------: | --------------------: | --------------------------------: | ----------------: | -------------------------: | -------------------------: |
|
||||
| **GLM-5.1-FP8 @ 8×H20 141GB** | $1128$ GB | $756$ GB | $372$ GB | $89{,}856$ B $\approx 87.75$ KiB | $\approx 4.14$ M | $\approx 258.7$ K | $\approx 129.4$ K |
|
||||
| **Qwen3-Coder-480B-FP8 @ 8×H20 141GB** | $1128$ GB | $482$ GB | $646$ GB | $253{,}952$ B $\approx 248.0$ KiB | $\approx 2.54$ M | $\approx 159.0$ K | $\approx 79.5$ K |
|
||||
| **GLM-5.1-FP8 @ 8×B300 288GB** | $2304$ GB | $756$ GB | $1548$ GB | $89{,}856$ B $\approx 87.75$ KiB | $\approx 17.23$ M | $\approx 1076.7$ K | $\approx 538.4$ K |
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
workload 出图:
|
||||
基本信息:
|
||||
1. input/output length cdf,同时给出 80% zoom in 的 cdf,input/output length 图片下方分别标注 mean/p50/p80/p90/p95/p99
|
||||
session 相关:
|
||||
2. session turns 的 cdf,给出前 10% turns 的 zoom in,同样在图片下方标注 mean/p50/p90/p95/p99 turns
|
||||
3. mean/p50/p99 请求长度随着 turns 变化的折线图(取所有 turns 相同的请求为一类进行计算)
|
||||
tool 相关:
|
||||
4. tool/user/… 各类 role 触发的请求比例的饼图,标出百分比和数据
|
||||
5. tool call 的 output length 的 cdf,同样标出 mean/p50/p90/p95/p99
|
||||
6. tool call 耗时(tool call 请求返回到 tool call 结果返回触发下一轮请求的时间间隔)的 cdf,给出 90% zoom in,同样在图片下方标出 mean/p50/p90/p95/p99
|
||||
7. 一次 user 的 input 后连续的 tool call 次数的 cdf,同样在图片下方标出 mean/p50/p90/p95/p99
|
||||
8. session 中tool call 触发的下一轮对话的新增 context(input length 相比上一轮 output length)的 cdf,同样给出 mean/p50/p90/p95/p99
|
||||
kvcache 相关:
|
||||
9. kvcache block reuse time 的 cdf(相同的 kvcache block,每次 reuse 与上一次 reuse/第一次产生 的时间间隔记作一次),给出 90% 的 zoom in,同样在图片下方标注 mean/p50/p90/p95/p99
|
||||
10. kvcache block 的生命周期(第一次出现到最后一次 reuse 的时间间隔)的 cdf,同样标出 mean/p50/p90/p95/p99
|
||||
11. 存活 kvcache blocks 数量随着时间变化的折线图,x 轴用 10 分钟一个 tick
|
||||
分桶相关:
|
||||
12. 每个分桶的请求独立只能复用相同桶内的 prefix kvcache 的情况下每个分桶内的 kvcache reuse ratio,以及不分桶,所有请求总的 kvcache reuse ratio,给出柱状图,同时标出每个分桶/不分桶的 reused blocks 总数
|
||||
13. session 前后轮请求跨分桶时引起的 kvcache miss 以及减少的 kvcache reused blocks 在对应分桶总 kvcache reuse 的比例
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
service_name = 33e6d810.deployment.qwen-max-mainline-chat-* and response_params not null
|
||||
|
||||
|
||||
```json
|
||||
{
|
||||
"kv-transfer-config":{
|
||||
"kv_connector": "DecodeBenchConnector",
|
||||
"kv_role": "kv_both"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
DS_LLM_DISAGGREGATED_DECODE_NODE
|
||||
enable_disaggregated_prefilling
|
||||
kv_transfer_config
|
||||
DS_LLM_FORCED_MAX_NEW_TOKENS=1
|
||||
DS_LLM_FORCED_MAX_NEW_TOKENS
|
||||
|
||||
DASHGEN_DEPLOYMENT_ROLE=prefill
|
||||
|
||||
```
|
||||
export DASH_GEN_ARGS=${DASH_GEN_ARGS:-'{"enable_disaggregated_prefilling": 1, "enable_think": 0, "think_mode": "auto", "gpu_memory_utilization": 0.85, "max_model_len": 262144, "enable_chunked_prefill": true, "speculative_config": {"method": "eagle3", "num_speculative_tokens": 1, "hf_overrides": {"rope_scaling": {"type": "yarn", "factor": 128, "original_max_position_embeddings": 2048, "semi_dynamic": false, "dynamic": true}, "num_experts": 0}, "model": "/home/admin/resource/model/464482ce.qwen3-235b-a22b/0717-eagle-0820"}, "swap_space": 0, "enable_prefix_caching": true, "max_num_batched_tokens": 8192, "enable_expert_parallel": false, "disable_hybrid_kv_cache_manager": true, "block_size": 64, "max_num_seqs": 64, "enforce_eager": false, "quantization": "fp8", "cuda_graph_sizes": [16, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800, 832, 864, 896, 928, 960, 992, 1024], "compilation_config": {"cudagraph_mode": "PIECEWISE", "use_inductor": false, "custom_ops": ["all"], "max_cudagraph_capture_size": 2048}, "kv_transfer_config": {"kv_connector": "HybridConnector", "kv_role": "kv_producer", "kv_connector_extra_config": {"backend": "kvt"}}, "hf_overrides": {"architectures": ["Qwen3MoeForCausalLM"], "model_type": "qwen3_moe"}, "kv_cache_dtype": "fp8"}'}
|
||||
|
||||
export DASH_GEN_ARGS=${DASH_GEN_ARGS:-'{"enable_disaggregated_prefilling": false, "enable_think": 0, "think_mode": "auto", "gpu_memory_utilization": 0.85, "max_model_len": 262144, "enable_chunked_prefill": true, "speculative_config": {"method": "eagle3", "num_speculative_tokens": 1, "hf_overrides": {"rope_scaling": {"type": "yarn", "factor": 128, "original_max_position_embeddings": 2048, "semi_dynamic": false, "dynamic": true}, "num_experts": 0}, "model": "/home/admin/resource/model/464482ce.qwen3-235b-a22b/0717-eagle-0820"}, "swap_space": 0, "enable_prefix_caching": true, "max_num_batched_tokens": 8192, "enable_expert_parallel": false, "disable_hybrid_kv_cache_manager": true, "block_size": 64, "max_num_seqs": 64, "enforce_eager": false, "quantization": "fp8", "cuda_graph_sizes": [16, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800, 832, 864, 896, 928, 960, 992, 1024], "compilation_config": {"cudagraph_mode": "PIECEWISE", "use_inductor": false, "custom_ops": ["all"], "max_cudagraph_capture_size": 2048}, "kv_transfer_config": "", "hf_overrides": {"architectures": ["Qwen3MoeForCausalLM"], "model_type": "qwen3_moe"}, "kv_cache_dtype": "fp8"}'}
|
||||
```
|
||||
|
||||
## coding agent 的 trace analysis
|
||||
|
||||
1. GLM5 040909-040911
|
||||
2. GLM5 040915-040917
|
||||
3. Qwen3-Coder 040915-040917
|
||||
|
||||
## Ali kvcache 现状
|
||||
|
||||
GLM 5 分桶:
|
||||
0-32k
|
||||
32k-85k
|
||||
85k-128k
|
||||
128k++
|
||||
|
||||
问题:
|
||||
- 调度器如何影响 kvcache 命中率?5min TTL 会不会引起误判
|
||||
- kvcache 通过 vineyard 传输的开销
|
||||
- vineyard remote cache 的价值
|
||||
|
||||
|
||||
|
||||
trace 分析:
|
||||
1. input/output length 的 CDF
|
||||
2. session turns 的 CDF
|
||||
3. session prev turn -> next 转移的触发原因的占比柱状图
|
||||
4. kvcache block 的重用时间 CDF
|
||||
5. kvcache lifespan 的 CDF
|
||||
6. tool call 耗时的 CDF
|
||||
7. tool call 的新增 tokens 分布
|
||||
8. tool call 的新增 tokens,相比当前 context 的长度占比
|
||||
|
||||
|
||||
|
||||
---
|
||||
dash0 qwen235b prefill-only thinking tuning step slo246
|
||||
dash1 上的 qwen27b chat 0-8k tuned best vs baseline
|
||||
dash2 qwen235b decode-only thinking tuning with 0323
|
||||
|
||||
baseline: 0414, 5~7 点的测试监控
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
---
|
||||
```
|
||||
Turn 1 Input: 1 2 3 In trace: 1 2 3
|
||||
Turn 1 Output: 4 5 6
|
||||
Turn 2 Input 7 8 Append Only Context: 1 2 3 4 5 6 7 8
|
||||
Real Context: 1 2 4 5 6 7 8
|
||||
Hit: 1 2
|
||||
Turn 2 Output: 9
|
||||
Turn 3 Input: 10 11 Append Only Context: 1 2 3 4 5 6 7 8 9 10 11
|
||||
Real Context: 1 2 4 5 6 7 9 10 11
|
||||
Hit: 1 2 4 5 6 7
|
||||
|
||||
```
|
||||
|
||||
|
||||
```
|
||||
Turn 1 Input: 1 2 3 In trace: 1 2 3
|
||||
Turn 1 Output: 90 91 92 Output in trace: 4 5 6
|
||||
Turn 2 Input 7 8 In trace: 1 2 3 4 5 6 7 8
|
||||
Real Context in Engine: 1 2 3 90 91 92 7 8
|
||||
Hit: 1 2 3
|
||||
|
||||
```
|
||||
0
projects/agentic-kvcache/outline.md
Normal file
0
projects/agentic-kvcache/outline.md
Normal file
39
projects/agentic-kvcache/related-works.md
Normal file
39
projects/agentic-kvcache/related-works.md
Normal file
@@ -0,0 +1,39 @@
|
||||
LAPS: A Length-Aware-Prefill LLM Serving System
|
||||
https://arxiv.org/abs/2601.11589
|
||||
|
||||
|
||||
Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
|
||||
https://arxiv.org/abs/2601.19910
|
||||
|
||||
|
||||
|
||||
DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference
|
||||
https://arxiv.org/abs/2602.21548
|
||||
|
||||
|
||||
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
|
||||
https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf
|
||||
|
||||
|
||||
Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving
|
||||
https://arxiv.org/pdf/2603.13358
|
||||
|
||||
|
||||
Efficient Multi-round LLM Inference over Disaggregated Serving
|
||||
https://arxiv.org/pdf/2602.14516v1
|
||||
|
||||
|
||||
Roadmap: SGLang Distributed KVCache System For Agentic Workload
|
||||
https://github.com/sgl-project/sglang/issues/21846
|
||||
|
||||
RFC: P/D Prefill compute optimizations with bi-directional KV cache transfers between P and D nodes
|
||||
https://github.com/vllm-project/vllm/issues/32733
|
||||
|
||||
|
||||
|
||||
- [Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving](https://arxiv.org/html/2603.13358v1?utm_source=chatgpt.com)
|
||||
- [SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents](https://arxiv.org/html/2602.09447v1?utm_source=chatgpt.com)
|
||||
- [DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference](https://arxiv.org/html/2602.21548v1?utm_source=chatgpt.com)
|
||||
- [Prefill/Decode Disaggregation | llm-d](https://llm-d.ai/docs/guide/Installation/pd-disaggregation?utm_source=chatgpt.com)
|
||||
- [CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serving | USENIX](https://www.usenix.org/conference/fast26/presentation/liu-yang?utm_source=chatgpt.com)
|
||||
|
||||
0
projects/agentic-kvcache/roadmap.md
Normal file
0
projects/agentic-kvcache/roadmap.md
Normal file
25
projects/agentic-kvcache/sync.md
Normal file
25
projects/agentic-kvcache/sync.md
Normal file
@@ -0,0 +1,25 @@
|
||||
| 分桶 | 请求数 | SLA | 实例数 | estimated_ttft mean TTFT | 无限空间上限 |
|
||||
| ------- | ------: | -----: | --: | -----------------------: | -----: |
|
||||
| 0-32k | 637,142 | <= 5s | 64 | 0.502s | 59.63% |
|
||||
| 32-85k | 99,735 | <= 10s | 48 | 2.801s | 82.35% |
|
||||
| 85-128k | 23,624 | <= 15s | 16 | 9.669s | 84.25% |
|
||||
| 128k+ | 3,226 | <= 20s | 6 | 9.572s | 82.99% |
|
||||
|
||||
| 分桶 | 最优路由 | 最优 TTFT / Hit / Gap | cache_score | cache_score_strong |
|
||||
| ------- | ------------------------ | ------------------------ | ------------------------- | ------------------------- |
|
||||
| 0-32k | cache_affinity_weak_rend | 0.488s / 56.11% / 3.52pp | 0.536s / 54.45% / 5.18pp | 0.813s / 56.97% / 2.66pp |
|
||||
| 32-85k | estimated_ttft | 2.801s / 76.70% / 5.66pp | 3.766s / 77.52% / 4.83pp | 5.193s / 78.00% / 4.35pp |
|
||||
| 85-128k | cache_affinity_weak_rend | 9.289s / 77.12% / 7.13pp | 9.408s / 77.07% / 7.18pp | 11.906s / 76.87% / 7.38pp |
|
||||
| 128k+ | estimated_ttft | 9.572s / 74.44% / 8.54pp | 10.630s / 74.56% / 8.42pp | 11.481s / 74.39% / 8.59pp |
|
||||
|
||||
cache_score_strong 在 Qwen3 上并不占优。它只在 0-32k 和 32-85k 上拿到了略高的 hit ratio,但代价是更差的 TTFT;而在 85-128k 和 128k+ 上,它连命中率都没有优势,TTFT 还更差。也就是
|
||||
说,Qwen3 上“更激进地追 cache”并没有换来稳定收益。
|
||||
|
||||
cache_score 比 cache_score_strong 更稳。在四个桶里,它都比 cache_score_strong 有更好的 TTFT;命中率上和 cache_score_strong 很接近,甚至在长桶更好。如果只在 cache_score 和
|
||||
cache_score_strong 之间选,Qwen3 上应优先 cache_score。
|
||||
|
||||
全策略最优并不统一。0-32k 和 85-128k 最优是 cache_affinity_weak_rend,32-85k 和 128k+ 最优是 estimated_ttft。这说明 Qwen3 上不存在一个单一 policy 可以统治所有长度段,分桶后做差
|
||||
异化路由是有价值的。
|
||||
|
||||
从 gap 看,真正的主要问题不在 eviction,而在 workload ceiling 本身和在线放置策略。0-32k 的 ceiling 太低,在线路由再怎么优化也只能在 60% 左右附近打转;而中长桶 ceiling 很高,但当
|
||||
前最优在线策略仍然比无限空间上限差 5.7pp 到 8.5pp,说明还有 routing/placement headroom,不过不是 cache_score_strong 这条路。
|
||||
Reference in New Issue
Block a user