obsidian/ongoing.md at a57afa86b47c58aeca557e7cbcb0d38b81159d78

Files

Gahow Wang a57afa86b4 Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00

26 KiB

Raw Blame History

• 我已经跑完并整理了结果。结论比较明确：当前实现的 P-side priority eviction 没有带来可观收益；你的预期在机制上是对的，但当前 workload/触发条件下没有形成有效收益面。

  Micro 2P2D 结果
  同一份 trace：316 requests / 58 sessions / 2P2D / Mooncake / no-stream。

  | 组别 | ok/error | mean | p50 | p90 | p99 | direct-to-D | actual KV blocks | 说明 |
  |---|---:|---:|---:|---:|---:|---:|---:|---|
  | PD baseline | 316/0 | 22.879s | 25.565s | 29.026s | 30.642s | 0 | 86019 | 稳定 |
  | release, threshold=3 | 316/0 | 22.937s | 25.617s | 29.605s | 32.457s | 0 | 86019 | 稳定但无 direct 收益 |
  | priority, threshold=3 | 316/0 | 22.897s | 25.801s | 29.393s | 32.622s | 0 | 86019 | 只有 6 个 low-priority P request |
  | release, threshold=4 | 314/2 | 21.605s | 25.078s | 28.498s | 30.143s | 12 | 82892 | 有 KV 节省，但不稳 |
  | priority, threshold=4 | 314/2 | 21.743s | 25.286s | 29.094s | 30.561s | 12 | 82892 | priority 未改善，略慢 |

  priority 组路径：
  outputs/live-priority-eviction/...
  outputs/live-priority-eviction-thr4/...

  Ali Filtered 结果
  生成的 filtered trace：
  outputs/ali-filtered-smallappend-priority-test.jsonl

  规模：81 requests / 28 sessions / 53 turn2+ / max input 18901 / max output 1925 / span 5414s。

  PD baseline 完成：

  - ok/error: 81/0
  - mean: 9.072s
  - p50: 7.086s
  - p90: 21.761s
  - p99: 26.813s
  - actual KV blocks: 460

  但 kvcache-centric 在 Ali filtered 上两次不收敛：

  - threshold=4：卡在约 58/81 个 /generate 200 OK 后，GPU 0%，无 metrics
  - threshold=3：卡在约 62/81 个 /generate 200 OK 后，GPU 0%，无 metrics

  我已中止这两组，避免等 1200s/3600s timeout。这个结果说明：Ali filtered 上当前 kvcache-centric direct/session 机制本身还不够稳，priority eviction ablation 不能成立。

  是否符合预期
  部分符合，但没有转化为性能收益。

  你的假设是：P->D 之后，P 可以优先释放/淘汰这部分 session 的 prefix cache，从而提高 P 侧 prefix reuse。这个假设在机制层面成立，但实验里不明显，原因是：

  1. release-after-transfer 已经解决了 P streaming session backup 与 D session duplicate 的显式重复。
  2. 新加的 priority eviction 只影响 P radix/prefix cache 的 eviction 顺序，只有在 P radix cache 有真实压力时才会起作用。
  3. micro threshold=3 为了稳定，几乎禁用了 direct-to-D，low-priority 只有 6 个 turn1 seed，触发面太小。
  4. micro threshold=4 有 12 个 direct-to-D，KV block 从 86019 降到 82892，但同时出现 2 个错误，priority 本身没有进一步减少 KV 或提升 cache hit。
  5. SGLang radix priority 对共享 prefix 是 max(priority) 传播；如果低优先级 direct session 和高优先级普通 session 共享 prefix，高优先级会保护共享节点，低优先级无法单独让共享
     prefix 被淘汰。

  下一步优化点
  优先级应该从“只标记 P radix eviction”升级为更完整的 admission/eviction 机制：

  6. 先修 Ali kvcache-centric 不收敛：给 direct-to-D / seed / close_session 加更短的 per-request timeout 和失败回退，避免 replay 卡死。
  7. 记录 P 侧真实 radix eviction / cached-token / token usage，而不是只看请求级 cached_tokens。
  8. 将 direct-to-D 预测从“是否 seed”改成“是否值得让 P 保留 prefix”：只有预计后续仍走 P 的 session 给高优先级。
  9. 对 P radix cache 做 session-aware demotion，不能只靠 request priority，因为共享 radix node 会被高优先级 prefix 污染。
  10. multi P/D 下要让 P 的低优先级策略和 D sticky/direct 策略一致，否则 P 释放了 A session，但后续又被路由回 P 做大 prefill，会反而恶化。

  当前结论：kvcache-centric 能减少 P/D streaming session duplicate，但还没有证明能提高 P radix/prefix cache reuse；在 micro 上收益主要来自少量 direct-to-D 减少 KV transfer，在
  Ali filtered 上机制稳定性还没过关。

• 已跑完更完整的 multi P/D sweep，并做了 session 间隔、机制来源和优化可行性分析。结论先说：当前不能证明 kvcache-centric 在完整 workload 上稳定优于 PD-disaggregation；有效收益主要
  来自 D 侧 session/KV reuse，而不是 P-node cache hit。更大的实验里 D 扩展明显比 P 扩展重要。

  实验设置
  使用更完整的 synthetic micro trace：outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl

  - 316 requests，58 sessions，全部多轮
  - turn 数：2 到 10，平均 5.45 turns/session
  - 初始输入 30k tokens，后续 append 1k tokens，输出 256 tokens
  - 原始 duration 1322s，replay 使用 --time-scale 50，所以运行时 session gap 约缩短 50 倍
  - no-stream，concurrency-limit 1000，内部横向对比有效；不要和之前 stream 结果直接混比

  Session 间隔对比
  完整 micro：

  | trace | requests | sessions | multi-turn sessions | inter-turn gap p50 | p90 | p99 |
  |---|---:|---:|---:|---:|---:|---:|
  | full micro 原始 | 316 | 58 | 58 | 100.5s | 123.0s | 129.4s |
  | full micro replay 等效 | 316 | 58 | 58 | 2.01s | 2.46s | 2.59s |
  | Ali filtered small | 24 | 7 | 7 | 2.42s | 4.65s | 10.02s |
  | Ali normalized 10min | 142 | 41 | 6 | 4.26s | 8.50s | 379.4s |

  这个说明：新的 full micro 在 replay 后的间隔已经接近 Ali filtered small，但它比 Ali normalized 更“理想化”：所有 session 都多轮，而 Ali normalized 只有 6/41 个 session 多轮，大
  量 turn1 seed 会污染 KV/cache budget。

  PD-disaggregation multi P/D 结果
  所有 PD run 都 0 error。

  | topology | mean E2E | p50 | p90 | p99 | 结论 |
  |---|---:|---:|---:|---:|---|
  | 1P1D | 44.125s | 50.473s | 56.743s | 58.872s | baseline |
  | 2P1D | 42.801s | 49.016s | 54.925s | 57.446s | 只加 P，收益很小，约 3.0% |
  | 1P2D | 29.641s | 29.261s | 45.614s | 53.433s | 只加 D，收益明显，约 32.8% |
  | 2P2D | 22.879s | 25.563s | 29.023s | 30.612s | 最好，约 48.2% |

  这组数据比 12-request micro 更可信。在这个 workload 下，D-node 数量比 P-node 数量更关键，因为 E2E 主要受 decode 并发、D 侧排队和 D 侧 session residency 影响。

  kvcache-centric 结果
  成功请求均值看起来有时更低，但有大量 error，所以不能当作有效胜利。

  | admission | topology | errors | successful mean | direct-to-D | 主要问题 |
  |---|---:|---:|---:|---:|---|
  | worker | 1P1D | 0 | 45.333s | 0 | 比 PD 1P1D 更差 |
  | worker | 2P1D | 1 | 43.471s | 0 | 只加 P 仍然有限 |
  | worker | 1P2D | 33 | 25.109s | 0 | success-only 均值有偏，错误太多 |
  | worker | 2P2D | 32 | 19.733s | 1 | 不稳定，不能算赢 |
  | router | 1P2D | 58 | 23.536s | 0 | kvcache path error 更多 |
  | router | 2P2D | 38 | 20.185s | 0 | 仍不稳定 |

  关键观察：multi-D kvcache-centric 的成功均值低，是因为失败请求被排除了；同时 direct-to-D = 0/1，说明当前并没有稳定跑到我们真正想验证的“D 上已有 KV，后续直接打 D”机制。大部分请
  求落在 pd-router-fallback-large-append-session-cap，也就是被 session-cap/admission 拦回 PD router。

  P-node 是否更好
  目前数据不支持“P-node 更好”这个结论。

  - PD 下 1P1D 到 2P1D 只提升 44.125s -> 42.801s。
  - PD 下 1P1D 到 1P2D 提升 44.125s -> 29.641s。
  - kvcache-centric 的目标路径是 direct-to-D，命中后绕过 P，所以 P-node cache hit 不是主要收益来源。
  - 当前实现里 P 可能保留 prefill backup KV，因此之前的 kvcache-centric 收益混合了 D reuse、KV transfer 减少、P backup duplicate、fallback/reseed，不能解释为纯 P cache hit。

  对于 xP + yD：

  - 固定 yD 只加 P，只有在 prefill 明显堵塞时有用；这次不是主瓶颈。
  - 固定 xP 加 D，直接缓解 decode 并发、D session residency 和 sticky-D 命中后的排队，是主要收益。
  - 多 D 下 kvcache-centric 理论上更有价值，但前提是 admission 正确、session-cap 不误杀、direct path 稳定。当前还没满足。

  优化结论
  我已经加了 seed admission 的 ablation 开关，涉及 src/agentic_pd_hybrid/replay.py、src/agentic_pd_hybrid/benchmark.py、src/agentic_pd_hybrid/cli.py，并补了 src/
  agentic_pd_hybrid/profile.py。uv run python -m compileall src 已通过。

  可行优化：

  1. multi-turn-aware seed admission
     Ali normalized oracle ablation：PD mean 19.643s，old kvcache 21.390s，oracle multi-turn seed 16.417s。这个方向有效，但生产里不能用 oracle，需要用 trace/session 特征预测哪些
     session 会多轮。
  2. D session-cap/admission 重做
     当前 full micro 中 kvcache direct 几乎没有触发，主因是 session-cap/admission。需要把“是否 seed/reseed”和“是否 direct-to-D”分开 admission，并引入按 resident tokens、last
     access、expected future turns 的 eviction。
  3. P backup 策略显式化
     要单独实验：
      - seed 成功后 P 立即释放 KV
      - hot session 才保留 P backup
      - fallback/reseed 才使用 P backup
        否则我们无法回答 P cache hit 到底贡献多少。
  4. multi-D kvcache path 稳定性
     worker/router admission 在 accelerated multi-D 下都有 500/connection 类错误。这个要先修，否则 success-only mean 没有说服力。
  5. topology 策略
     当前 8 GPU 约束下，优先保证 D 数量。对 agentic coding workload，下一组值得测的是 1P3D、2P3D、2P4D，看 D 扩展何时饱和；P 只作为 prefill/transfer 不堵时的最小足够资源。

  最终判断：kvcache-centric 的优化来源应当是“少传 KV + D 侧 session reuse + 避免 re-prefill”，不是 P-node cache hit。当前实现还没有稳定兑现这个机制；下一步应先修 admission/
  session-cap/direct path，再做 P backup ablation。

• I reran the 10-minute normalized benchmark on the sampled Ali shard at outputs/ali/normalized/ali-10min-sampled-10k-append1k-out1k.jsonl. The two completed summaries are
  outputs/ali/live-kvcache-centric-ali-10min-norm-10k1k1k/kvcache-centric-sticky-20260424T042535Z/request-metrics.jsonl.summary.json and outputs/ali/live-pd-disaggregation-ali-
  10min-norm-10k1k1k/pd-disaggregation-sticky-20260424T044951Z/request-metrics.jsonl.summary.json.

  On this workload, pd-disaggregation is better than the current router-managed kvcache-centric:

  - Mean E2E: 19.64s vs 21.39s for kvcache-centric
  - Mean TTFT: 1.15s vs 2.69s
  - Mean TPOT: 0.01849s vs 0.01870s
  - P90 E2E: 34.80s vs 35.30s
  - P99 E2E is basically tied: 35.56s vs 35.53s

  The important part is why. The direct-to-D path itself is good, but it is not used enough:

  - kvcache-direct-to-d-session happened on only 43 requests
  - Those 43 requests were actually better than pd-disaggregation on the same request IDs:
      - mean E2E delta: -0.374s
      - mean TTFT delta: -0.477s
  - But 55 requests took pd-router-fallback-large-append, and those were much worse than pd-disaggregation:
      - mean E2E delta: +4.445s
      - mean TTFT delta: +4.212s

  So the current loss is not “direct-to-D is bad”. It is “too few requests get the good path, and too many fall back expensively”.

  A few concrete observations:

  - This shard has 142 requests, 41 sessions, and only 6 multi-turn sessions.
  - Only 2 of those 6 multi-turn sessions actually stayed on the direct-to-D path.
  - pd-disaggregation still used prefix cache on the P-node heavily:
      - cache_hit_request_count=139
      - mean cached tokens 24924.5
  - kvcache-centric reduced actual KV transfer from 25513 blocks to 21930 blocks, saving 3583 blocks, but that was not enough to offset the worse TTFT.

  The most actionable diagnosis is that the router’s direct-to-D decision is still too conservative or too misaligned with actual cache state. In the bad sessions, the run
  labeled many requests as fallback-large-append even when the P-side observed cache hit was almost the whole prompt. That means the router-side estimate for “uncached suffix is
  small enough” is not matching what the workers can really reuse.

  What makes kvcache-centric better:

  - The decode node already has the session KV resident.
  - The uncached suffix is truly small.
  - The request can bypass P entirely.
  - Decode-side queueing stays controlled.

  What makes it worse:

  - The direct-hit rate is low.
  - Requests fall back after router-side misclassification.
  - Decode node takes extra append-prefill pressure without enough saved transfer/prefill cost.
  - pd-disaggregation already gets strong P-side prefix reuse, so the gap to beat is smaller than “full re-prefill”.

  The next optimizations I would do are:

  - Make direct-to-D eligibility depend on worker-observed reusable prefix, not only the router’s session-length heuristic.
  - Preserve decode residency/stickiness better once a session is seeded on D.
  - Avoid classifying requests as “large append” when the actual uncached suffix is still small after radix reuse.

  If you want, I can move directly to patching the routing criterion using worker-observed prefix-hit information and rerun this same normalized benchmark.

• I fixed the mechanism in two places.

src/agentic_pd_hybrid/replay.py now uses live decode load from /v1/loads instead of only static max_total_num_tokens, and it distinguishes direct reuse from fresh seed/ reseed.
third_party/sglang/python/sglang/srt/disaggregation/decode.py no longer crashes on multiple reused-session requests in one decode-local extend batch.

Measured Results Baseline pd-disaggregation:

Mean TTFT 9.32s
Mean TPOT 26.11 ms/token
Mean E2E 35.44s
Actual KV transfer 20004 blocks

Best measured kvcache-centric so far is auto4:

Summary: outputs/micro/live-kvcache-centric-smallappend-12s-16t-20gap-rdma-auto4/kvcache-centric-sticky-20260423T125640Z/request-metrics.jsonl.summary.json
Mean TTFT 5.44s
Mean TPOT 25.45 ms/token
Mean E2E 31.83s
Actual KV transfer 6670 blocks
Direct-to-D requests 160/180 reuse turns
This is about 10.2% better E2E than pd-disaggregation
But decode still hit KV cache pool is full twice, so it was faster but not at a safe operating point

Measured conservative live-load gate is auto5:

Summary: outputs/micro/live-kvcache-centric-smallappend-12s-16t-20gap-rdma-auto5/kvcache-centric-sticky-20260423T162904Z/request-metrics.jsonl.summary.json
Mean TTFT 23.36s
Mean TPOT 19.05 ms/token
Mean E2E 42.41s
Actual KV transfer 15589 blocks
Direct-to-D requests only 53/180
Stable, but worse than pd-disaggregation

A tuned split-gate version is now implemented in src/agentic_pd_hybrid/replay.py. Its validation run auto6 is still draining, so I am not claiming results for it yet.

Serious Analysis kvcache-centric is better than pd-disaggregation when:

Turn2+ has a very small append and a very large reused prefix.
The decode node already has the session KV.
Decode has enough free KV blocks and is not already carrying a transfer/prealloc backlog.
In that regime, it avoids both P-side append-prefill and P->D KV transfer.
That is exactly why auto4 won: 160 turns went direct, KV transfer dropped by 66.7%, and turn2-4 TTFT stayed around 0.19s.

kvcache-centric becomes worse than pd-disaggregation when:

The decode node is the bottleneck.
Direct append-prefill consumes decode KV and decode compute at the same time.
Some requests still go direct while others fall back to PD.
That mixed regime is the worst case: direct requests occupy decode, then fallback requests pay full PD transfer cost on top of an already busy decode.
That is what auto5 shows. It avoided pool-full events, but it rejected too many direct hits. Late turns then became mostly PD fallback with very large cached contexts, so TTFT exploded:
- turn9-12 mean TTFT 33.43s
- turn13-16 mean TTFT 53.03s
In auto5, the bad tail is not direct requests. It is fallback traffic:
- pd-router-fallback-d-backpressure mean TTFT 39.18s
- pd-router-fallback-no-d-capacity mean TTFT 32.60s
So the real problem is not “direct-to-D is slow”. The problem is “once D gets too full, fallback happens too late, and fallback requests are then huge”.

Why kvcache-centric can beat PD in principle:

It removes the extra RDMA KV hop on reuse turns.
It avoids re-prefilling the whole reused prefix on P.
It reduces TTFT sharply when append is small.
It can reduce P-node pressure for multi-turn agentic traces.

Why it can lose in principle:

D has finite KV capacity, while session lengths keep growing every turn.
D-local append-prefill is not free; it competes with decode for the same GPU.
If eviction is naive, you either keep too many stale sessions or evict useful ones and reseed later.
If admission is too loose, you hit decode retraction.
If admission is too conservative, you lose the direct-hit advantage and revert to slow PD fallback.

What To Optimize Next

Use benefit-cost admission, not a binary threshold.
- Direct only if saved KV transfer + saved P queueing is larger than added decode-prefill cost + added decode-pressure risk.
Keep separate rules for direct reuse and seed/reseed.
- This is now coded in src/agentic_pd_hybrid/replay.py, because seed is much more expensive than direct reuse.
Add hysteresis.
- Stop new seeds earlier.
- Stop direct reuse only near true danger.
- Resume only after decode load falls well below the stop threshold.
Make the append threshold dynamic.
- Low load: allow append up to 2k.
- Medium load: maybe 512-1k.
- High load: 0, force PD.
Replace plain LRU eviction with reuse-aware eviction.
- Score by recent access, current prefix size, and expected future turns.
- Evict low-future-value sessions first.
Prefer adding D capacity over P capacity for this workload.
- The late-turn problem here is decode KV pressure, not P throughput.
- More P helps turn1-heavy traces.
- More D helps reuse-heavy multi-turn traces like this one.
If scaling nodes, use kvcache-centric only on sessions whose decode home is not near pressure.
- Otherwise route them back to PD early, before the context becomes huge.

Bottom Line

Best measured performance result right now is still auto4, but it is not safe enough.
Best measured safe result is auto5, but it is too conservative and slower than baseline PD.
The correct direction is not “always direct” and not “always back off”.
The correct policy is “direct reuse aggressively when D is healthy, but stop seeding much earlier than you stop direct reuse”.
That split policy is now implemented in src/agentic_pd_hybrid/replay.py. The validation run for it is still in progress.

Setup	总显存	模型权重占用	可留给 KVCache 的空间（理论上限）	每 token KV 大小	可存 tokens（理论上限）	可存 blocks（block size = 16）	可存 blocks（block size = 32）
GLM-5.1-FP8 @ 8×H20 141GB	`1128` GB	`756` GB	`372` GB	`89{,}856` B `\approx 87.75` KiB	`\approx 4.14` M	`\approx 258.7` K	`\approx 129.4` K
Qwen3-Coder-480B-FP8 @ 8×H20 141GB	`1128` GB	`482` GB	`646` GB	`253{,}952` B `\approx 248.0` KiB	`\approx 2.54` M	`\approx 159.0` K	`\approx 79.5` K
GLM-5.1-FP8 @ 8×B300 288GB	`2304` GB	`756` GB	`1548` GB	`89{,}856` B `\approx 87.75` KiB	`\approx 17.23` M	`\approx 1076.7` K	`\approx 538.4` K

workload 出图：基本信息：

input/output length cdf，同时给出 80% zoom in 的 cdf，input/output length 图片下方分别标注 mean/p50/p80/p90/p95/p99 session 相关：
session turns 的 cdf，给出前 10% turns 的 zoom in，同样在图片下方标注 mean/p50/p90/p95/p99 turns
mean/p50/p99 请求长度随着 turns 变化的折线图（取所有 turns 相同的请求为一类进行计算） tool 相关：
tool/user/… 各类 role 触发的请求比例的饼图，标出百分比和数据
tool call 的 output length 的 cdf，同样标出 mean/p50/p90/p95/p99
tool call 耗时（tool call 请求返回到 tool call 结果返回触发下一轮请求的时间间隔）的 cdf，给出 90% zoom in，同样在图片下方标出 mean/p50/p90/p95/p99
一次 user 的 input 后连续的 tool call 次数的 cdf，同样在图片下方标出 mean/p50/p90/p95/p99
session 中tool call 触发的下一轮对话的新增 context（input length 相比上一轮 output length）的 cdf，同样给出 mean/p50/p90/p95/p99 kvcache 相关：
kvcache block reuse time 的 cdf（相同的 kvcache block，每次 reuse 与上一次 reuse/第一次产生的时间间隔记作一次），给出 90% 的 zoom in，同样在图片下方标注 mean/p50/p90/p95/p99
kvcache block 的生命周期（第一次出现到最后一次 reuse 的时间间隔）的 cdf，同样标出 mean/p50/p90/p95/p99
存活 kvcache blocks 数量随着时间变化的折线图，x 轴用 10 分钟一个 tick 分桶相关：
每个分桶的请求独立只能复用相同桶内的 prefix kvcache 的情况下每个分桶内的 kvcache reuse ratio，以及不分桶，所有请求总的 kvcache reuse ratio，给出柱状图，同时标出每个分桶/不分桶的 reused blocks 总数
session 前后轮请求跨分桶时引起的 kvcache miss 以及减少的 kvcache reused blocks 在对应分桶总 kvcache reuse 的比例

service_name = 33e6d810.deployment.qwen-max-mainline-chat-* and response_params not null

{
  "kv-transfer-config":{
    "kv_connector": "DecodeBenchConnector",
    "kv_role": "kv_both"
  }
}

DS_LLM_DISAGGREGATED_DECODE_NODE enable_disaggregated_prefilling kv_transfer_config DS_LLM_FORCED_MAX_NEW_TOKENS=1 DS_LLM_FORCED_MAX_NEW_TOKENS

DASHGEN_DEPLOYMENT_ROLE=prefill

export DASH_GEN_ARGS=${DASH_GEN_ARGS:-'{"enable_disaggregated_prefilling": 1, "enable_think": 0, "think_mode": "auto", "gpu_memory_utilization": 0.85, "max_model_len": 262144, "enable_chunked_prefill": true, "speculative_config": {"method": "eagle3", "num_speculative_tokens": 1, "hf_overrides": {"rope_scaling": {"type": "yarn", "factor": 128, "original_max_position_embeddings": 2048, "semi_dynamic": false, "dynamic": true}, "num_experts": 0}, "model": "/home/admin/resource/model/464482ce.qwen3-235b-a22b/0717-eagle-0820"}, "swap_space": 0, "enable_prefix_caching": true, "max_num_batched_tokens": 8192, "enable_expert_parallel": false, "disable_hybrid_kv_cache_manager": true, "block_size": 64, "max_num_seqs": 64, "enforce_eager": false, "quantization": "fp8", "cuda_graph_sizes": [16, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800, 832, 864, 896, 928, 960, 992, 1024], "compilation_config": {"cudagraph_mode": "PIECEWISE", "use_inductor": false, "custom_ops": ["all"], "max_cudagraph_capture_size": 2048}, "kv_transfer_config": {"kv_connector": "HybridConnector", "kv_role": "kv_producer", "kv_connector_extra_config": {"backend": "kvt"}}, "hf_overrides": {"architectures": ["Qwen3MoeForCausalLM"], "model_type": "qwen3_moe"}, "kv_cache_dtype": "fp8"}'}

export DASH_GEN_ARGS=${DASH_GEN_ARGS:-'{"enable_disaggregated_prefilling": false, "enable_think": 0, "think_mode": "auto", "gpu_memory_utilization": 0.85, "max_model_len": 262144, "enable_chunked_prefill": true, "speculative_config": {"method": "eagle3", "num_speculative_tokens": 1, "hf_overrides": {"rope_scaling": {"type": "yarn", "factor": 128, "original_max_position_embeddings": 2048, "semi_dynamic": false, "dynamic": true}, "num_experts": 0}, "model": "/home/admin/resource/model/464482ce.qwen3-235b-a22b/0717-eagle-0820"}, "swap_space": 0, "enable_prefix_caching": true, "max_num_batched_tokens": 8192, "enable_expert_parallel": false, "disable_hybrid_kv_cache_manager": true, "block_size": 64, "max_num_seqs": 64, "enforce_eager": false, "quantization": "fp8", "cuda_graph_sizes": [16, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800, 832, 864, 896, 928, 960, 992, 1024], "compilation_config": {"cudagraph_mode": "PIECEWISE", "use_inductor": false, "custom_ops": ["all"], "max_cudagraph_capture_size": 2048}, "kv_transfer_config": "", "hf_overrides": {"architectures": ["Qwen3MoeForCausalLM"], "model_type": "qwen3_moe"}, "kv_cache_dtype": "fp8"}'}

coding agent 的 trace analysis

GLM5 040909-040911
GLM5 040915-040917
Qwen3-Coder 040915-040917

Ali kvcache 现状

GLM 5 分桶： 0-32k 32k-85k 85k-128k 128k++

问题：

调度器如何影响 kvcache 命中率？5min TTL 会不会引起误判
kvcache 通过 vineyard 传输的开销
vineyard remote cache 的价值

trace 分析：

input/output length 的 CDF
session turns 的 CDF
session prev turn -> next 转移的触发原因的占比柱状图
kvcache block 的重用时间 CDF
kvcache lifespan 的 CDF
tool call 耗时的 CDF
tool call 的新增 tokens 分布
tool call 的新增 tokens，相比当前 context 的长度占比

dash0 qwen235b prefill-only thinking tuning step slo246 dash1 上的 qwen27b chat 0-8k tuned best vs baseline dash2 qwen235b decode-only thinking tuning with 0323

baseline: 0414, 5~7 点的测试监控

Turn 1 Input:  1 2 3         In trace: 1 2 3
Turn 1 Output: 4 5 6
Turn 2 Input   7 8           Append Only Context: 1 2 3 4 5 6 7 8
                             Real Context: 1 2 4 5 6 7 8
				                      Hit: 1 2
Turn 2 Output: 9
Turn 3 Input:  10 11         Append Only Context: 1 2 3 4 5 6 7 8 9 10 11
                             Real Context: 1 2 4 5 6 7 9 10 11
				                      Hit: 1 2 4 5 6 7

Turn 1 Input:  1 2 3         In trace: 1 2 3
Turn 1 Output: 90 91 92      Output in trace: 4 5 6
Turn 2 Input   7 8           In trace: 1 2 3 4  5  6  7 8
               Real Context in Engine: 1 2 3 90 91 92 7 8
				                  Hit: 1 2 3

26 KiB Raw Blame History Unescape Escape

coding agent 的 trace analysis

Ali kvcache 现状

26 KiB

Raw Blame History