Files
obsidian/projects/agentic-kvcache/ongoing.md

26 KiB
Raw Blame History

• 我已经跑完并整理了结果。结论比较明确:当前实现的 P-side priority eviction 没有带来可观收益;你的预期在机制上是对的,但当前 workload/触发条件下没有形成有效收益面。

  Micro 2P2D 结果
  同一份 trace316 requests / 58 sessions / 2P2D / Mooncake / no-stream。

  | 组别 | ok/error | mean | p50 | p90 | p99 | direct-to-D | actual KV blocks | 说明 |
  |---|---:|---:|---:|---:|---:|---:|---:|---|
  | PD baseline | 316/0 | 22.879s | 25.565s | 29.026s | 30.642s | 0 | 86019 | 稳定 |
  | release, threshold=3 | 316/0 | 22.937s | 25.617s | 29.605s | 32.457s | 0 | 86019 | 稳定但无 direct 收益 |
  | priority, threshold=3 | 316/0 | 22.897s | 25.801s | 29.393s | 32.622s | 0 | 86019 | 只有 6 个 low-priority P request |
  | release, threshold=4 | 314/2 | 21.605s | 25.078s | 28.498s | 30.143s | 12 | 82892 | 有 KV 节省,但不稳 |
  | priority, threshold=4 | 314/2 | 21.743s | 25.286s | 29.094s | 30.561s | 12 | 82892 | priority 未改善,略慢 |

  priority 组路径:
  outputs/live-priority-eviction/...
  outputs/live-priority-eviction-thr4/...

  Ali Filtered 结果
  生成的 filtered trace
  outputs/ali-filtered-smallappend-priority-test.jsonl

  规模81 requests / 28 sessions / 53 turn2+ / max input 18901 / max output 1925 / span 5414s。

  PD baseline 完成:

  - ok/error: 81/0
  - mean: 9.072s
  - p50: 7.086s
  - p90: 21.761s
  - p99: 26.813s
  - actual KV blocks: 460

  但 kvcache-centric 在 Ali filtered 上两次不收敛:

  - threshold=4卡在约 58/81 个 /generate 200 OK 后GPU 0%,无 metrics
  - threshold=3卡在约 62/81 个 /generate 200 OK 后GPU 0%,无 metrics

  我已中止这两组,避免等 1200s/3600s timeout。这个结果说明Ali filtered 上当前 kvcache-centric direct/session 机制本身还不够稳priority eviction ablation 不能成立。

  是否符合预期
  部分符合,但没有转化为性能收益。

  你的假设是P->D 之后P 可以优先释放/淘汰这部分 session 的 prefix cache从而提高 P 侧 prefix reuse。这个假设在机制层面成立但实验里不明显原因是

  1. release-after-transfer 已经解决了 P streaming session backup 与 D session duplicate 的显式重复。
  2. 新加的 priority eviction 只影响 P radix/prefix cache 的 eviction 顺序,只有在 P radix cache 有真实压力时才会起作用。
  3. micro threshold=3 为了稳定,几乎禁用了 direct-to-Dlow-priority 只有 6 个 turn1 seed触发面太小。
  4. micro threshold=4 有 12 个 direct-to-DKV block 从 86019 降到 82892但同时出现 2 个错误priority 本身没有进一步减少 KV 或提升 cache hit。
  5. SGLang radix priority 对共享 prefix 是 max(priority) 传播;如果低优先级 direct session 和高优先级普通 session 共享 prefix高优先级会保护共享节点低优先级无法单独让共享
     prefix 被淘汰。

  下一步优化点
  优先级应该从“只标记 P radix eviction”升级为更完整的 admission/eviction 机制:

  6. 先修 Ali kvcache-centric 不收敛:给 direct-to-D / seed / close_session 加更短的 per-request timeout 和失败回退,避免 replay 卡死。
  7. 记录 P 侧真实 radix eviction / cached-token / token usage而不是只看请求级 cached_tokens。
  8. 将 direct-to-D 预测从“是否 seed”改成“是否值得让 P 保留 prefix”只有预计后续仍走 P 的 session 给高优先级。
  9. 对 P radix cache 做 session-aware demotion不能只靠 request priority因为共享 radix node 会被高优先级 prefix 污染。
  10. multi P/D 下要让 P 的低优先级策略和 D sticky/direct 策略一致,否则 P 释放了 A session但后续又被路由回 P 做大 prefill会反而恶化。

  当前结论kvcache-centric 能减少 P/D streaming session duplicate但还没有证明能提高 P radix/prefix cache reuse在 micro 上收益主要来自少量 direct-to-D 减少 KV transfer在
  Ali filtered 上机制稳定性还没过关。


• 已跑完更完整的 multi P/D sweep并做了 session 间隔、机制来源和优化可行性分析。结论先说:当前不能证明 kvcache-centric 在完整 workload 上稳定优于 PD-disaggregation有效收益主要
  来自 D 侧 session/KV reuse而不是 P-node cache hit。更大的实验里 D 扩展明显比 P 扩展重要。

  实验设置
  使用更完整的 synthetic micro traceoutputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl

  - 316 requests58 sessions全部多轮
  - turn 数2 到 10平均 5.45 turns/session
  - 初始输入 30k tokens后续 append 1k tokens输出 256 tokens
  - 原始 duration 1322sreplay 使用 --time-scale 50所以运行时 session gap 约缩短 50 倍
  - no-streamconcurrency-limit 1000内部横向对比有效不要和之前 stream 结果直接混比

  Session 间隔对比
  完整 micro

  | trace | requests | sessions | multi-turn sessions | inter-turn gap p50 | p90 | p99 |
  |---|---:|---:|---:|---:|---:|---:|
  | full micro 原始 | 316 | 58 | 58 | 100.5s | 123.0s | 129.4s |
  | full micro replay 等效 | 316 | 58 | 58 | 2.01s | 2.46s | 2.59s |
  | Ali filtered small | 24 | 7 | 7 | 2.42s | 4.65s | 10.02s |
  | Ali normalized 10min | 142 | 41 | 6 | 4.26s | 8.50s | 379.4s |

  这个说明:新的 full micro 在 replay 后的间隔已经接近 Ali filtered small但它比 Ali normalized 更“理想化”:所有 session 都多轮,而 Ali normalized 只有 6/41 个 session 多轮,大
  量 turn1 seed 会污染 KV/cache budget。

  PD-disaggregation multi P/D 结果
  所有 PD run 都 0 error。

  | topology | mean E2E | p50 | p90 | p99 | 结论 |
  |---|---:|---:|---:|---:|---|
  | 1P1D | 44.125s | 50.473s | 56.743s | 58.872s | baseline |
  | 2P1D | 42.801s | 49.016s | 54.925s | 57.446s | 只加 P收益很小约 3.0% |
  | 1P2D | 29.641s | 29.261s | 45.614s | 53.433s | 只加 D收益明显约 32.8% |
  | 2P2D | 22.879s | 25.563s | 29.023s | 30.612s | 最好,约 48.2% |

  这组数据比 12-request micro 更可信。在这个 workload 下D-node 数量比 P-node 数量更关键,因为 E2E 主要受 decode 并发、D 侧排队和 D 侧 session residency 影响。

  kvcache-centric 结果
  成功请求均值看起来有时更低,但有大量 error所以不能当作有效胜利。

  | admission | topology | errors | successful mean | direct-to-D | 主要问题 |
  |---|---:|---:|---:|---:|---|
  | worker | 1P1D | 0 | 45.333s | 0 | 比 PD 1P1D 更差 |
  | worker | 2P1D | 1 | 43.471s | 0 | 只加 P 仍然有限 |
  | worker | 1P2D | 33 | 25.109s | 0 | success-only 均值有偏,错误太多 |
  | worker | 2P2D | 32 | 19.733s | 1 | 不稳定,不能算赢 |
  | router | 1P2D | 58 | 23.536s | 0 | kvcache path error 更多 |
  | router | 2P2D | 38 | 20.185s | 0 | 仍不稳定 |

  关键观察multi-D kvcache-centric 的成功均值低,是因为失败请求被排除了;同时 direct-to-D = 0/1说明当前并没有稳定跑到我们真正想验证的“D 上已有 KV后续直接打 D”机制。大部分请
  求落在 pd-router-fallback-large-append-session-cap也就是被 session-cap/admission 拦回 PD router。

  P-node 是否更好
  目前数据不支持“P-node 更好”这个结论。

  - PD 下 1P1D 到 2P1D 只提升 44.125s -> 42.801s。
  - PD 下 1P1D 到 1P2D 提升 44.125s -> 29.641s。
  - kvcache-centric 的目标路径是 direct-to-D命中后绕过 P所以 P-node cache hit 不是主要收益来源。
  - 当前实现里 P 可能保留 prefill backup KV因此之前的 kvcache-centric 收益混合了 D reuse、KV transfer 减少、P backup duplicate、fallback/reseed不能解释为纯 P cache hit。

  对于 xP + yD

  - 固定 yD 只加 P只有在 prefill 明显堵塞时有用;这次不是主瓶颈。
  - 固定 xP 加 D直接缓解 decode 并发、D session residency 和 sticky-D 命中后的排队,是主要收益。
  - 多 D 下 kvcache-centric 理论上更有价值,但前提是 admission 正确、session-cap 不误杀、direct path 稳定。当前还没满足。

  优化结论
  我已经加了 seed admission 的 ablation 开关,涉及 src/agentic_pd_hybrid/replay.py、src/agentic_pd_hybrid/benchmark.py、src/agentic_pd_hybrid/cli.py并补了 src/
  agentic_pd_hybrid/profile.py。uv run python -m compileall src 已通过。

  可行优化:

  1. multi-turn-aware seed admission
     Ali normalized oracle ablationPD mean 19.643sold kvcache 21.390soracle multi-turn seed 16.417s。这个方向有效,但生产里不能用 oracle需要用 trace/session 特征预测哪些
     session 会多轮。
  2. D session-cap/admission 重做
     当前 full micro 中 kvcache direct 几乎没有触发,主因是 session-cap/admission。需要把“是否 seed/reseed”和“是否 direct-to-D”分开 admission并引入按 resident tokens、last
     access、expected future turns 的 eviction。
  3. P backup 策略显式化
     要单独实验:
      - seed 成功后 P 立即释放 KV
      - hot session 才保留 P backup
      - fallback/reseed 才使用 P backup
        否则我们无法回答 P cache hit 到底贡献多少。
  4. multi-D kvcache path 稳定性
     worker/router admission 在 accelerated multi-D 下都有 500/connection 类错误。这个要先修,否则 success-only mean 没有说服力。
  5. topology 策略
     当前 8 GPU 约束下,优先保证 D 数量。对 agentic coding workload下一组值得测的是 1P3D、2P3D、2P4D看 D 扩展何时饱和P 只作为 prefill/transfer 不堵时的最小足够资源。

  最终判断kvcache-centric 的优化来源应当是“少传 KV + D 侧 session reuse + 避免 re-prefill”不是 P-node cache hit。当前实现还没有稳定兑现这个机制下一步应先修 admission/
  session-cap/direct path再做 P backup ablation。


• I reran the 10-minute normalized benchmark on the sampled Ali shard at outputs/ali/normalized/ali-10min-sampled-10k-append1k-out1k.jsonl. The two completed summaries are
  outputs/ali/live-kvcache-centric-ali-10min-norm-10k1k1k/kvcache-centric-sticky-20260424T042535Z/request-metrics.jsonl.summary.json and outputs/ali/live-pd-disaggregation-ali-
  10min-norm-10k1k1k/pd-disaggregation-sticky-20260424T044951Z/request-metrics.jsonl.summary.json.

  On this workload, pd-disaggregation is better than the current router-managed kvcache-centric:

  - Mean E2E: 19.64s vs 21.39s for kvcache-centric
  - Mean TTFT: 1.15s vs 2.69s
  - Mean TPOT: 0.01849s vs 0.01870s
  - P90 E2E: 34.80s vs 35.30s
  - P99 E2E is basically tied: 35.56s vs 35.53s

  The important part is why. The direct-to-D path itself is good, but it is not used enough:

  - kvcache-direct-to-d-session happened on only 43 requests
  - Those 43 requests were actually better than pd-disaggregation on the same request IDs:
      - mean E2E delta: -0.374s
      - mean TTFT delta: -0.477s
  - But 55 requests took pd-router-fallback-large-append, and those were much worse than pd-disaggregation:
      - mean E2E delta: +4.445s
      - mean TTFT delta: +4.212s

  So the current loss is not “direct-to-D is bad”. It is “too few requests get the good path, and too many fall back expensively”.

  A few concrete observations:

  - This shard has 142 requests, 41 sessions, and only 6 multi-turn sessions.
  - Only 2 of those 6 multi-turn sessions actually stayed on the direct-to-D path.
  - pd-disaggregation still used prefix cache on the P-node heavily:
      - cache_hit_request_count=139
      - mean cached tokens 24924.5
  - kvcache-centric reduced actual KV transfer from 25513 blocks to 21930 blocks, saving 3583 blocks, but that was not enough to offset the worse TTFT.

  The most actionable diagnosis is that the routers direct-to-D decision is still too conservative or too misaligned with actual cache state. In the bad sessions, the run
  labeled many requests as fallback-large-append even when the P-side observed cache hit was almost the whole prompt. That means the router-side estimate for “uncached suffix is
  small enough” is not matching what the workers can really reuse.

  What makes kvcache-centric better:

  - The decode node already has the session KV resident.
  - The uncached suffix is truly small.
  - The request can bypass P entirely.
  - Decode-side queueing stays controlled.

  What makes it worse:

  - The direct-hit rate is low.
  - Requests fall back after router-side misclassification.
  - Decode node takes extra append-prefill pressure without enough saved transfer/prefill cost.
  - pd-disaggregation already gets strong P-side prefix reuse, so the gap to beat is smaller than “full re-prefill”.

  The next optimizations I would do are:

  - Make direct-to-D eligibility depend on worker-observed reusable prefix, not only the routers session-length heuristic.
  - Preserve decode residency/stickiness better once a session is seeded on D.
  - Avoid classifying requests as “large append” when the actual uncached suffix is still small after radix reuse.

  If you want, I can move directly to patching the routing criterion using worker-observed prefix-hit information and rerun this same normalized benchmark.


• I fixed the mechanism in two places.

  • src/agentic_pd_hybrid/replay.py now uses live decode load from /v1/loads instead of only static max_total_num_tokens, and it distinguishes direct reuse from fresh seed/ reseed.
  • third_party/sglang/python/sglang/srt/disaggregation/decode.py no longer crashes on multiple reused-session requests in one decode-local extend batch.

Measured Results Baseline pd-disaggregation:

  • Mean TTFT 9.32s
  • Mean TPOT 26.11 ms/token
  • Mean E2E 35.44s
  • Actual KV transfer 20004 blocks

Best measured kvcache-centric so far is auto4:

  • Summary: outputs/micro/live-kvcache-centric-smallappend-12s-16t-20gap-rdma-auto4/kvcache-centric-sticky-20260423T125640Z/request-metrics.jsonl.summary.json
  • Mean TTFT 5.44s
  • Mean TPOT 25.45 ms/token
  • Mean E2E 31.83s
  • Actual KV transfer 6670 blocks
  • Direct-to-D requests 160/180 reuse turns
  • This is about 10.2% better E2E than pd-disaggregation
  • But decode still hit KV cache pool is full twice, so it was faster but not at a safe operating point

Measured conservative live-load gate is auto5:

  • Summary: outputs/micro/live-kvcache-centric-smallappend-12s-16t-20gap-rdma-auto5/kvcache-centric-sticky-20260423T162904Z/request-metrics.jsonl.summary.json
  • Mean TTFT 23.36s
  • Mean TPOT 19.05 ms/token
  • Mean E2E 42.41s
  • Actual KV transfer 15589 blocks
  • Direct-to-D requests only 53/180
  • Stable, but worse than pd-disaggregation

A tuned split-gate version is now implemented in src/agentic_pd_hybrid/replay.py. Its validation run auto6 is still draining, so I am not claiming results for it yet.

Serious Analysis kvcache-centric is better than pd-disaggregation when:

  • Turn2+ has a very small append and a very large reused prefix.
  • The decode node already has the session KV.
  • Decode has enough free KV blocks and is not already carrying a transfer/prealloc backlog.
  • In that regime, it avoids both P-side append-prefill and P->D KV transfer.
  • That is exactly why auto4 won: 160 turns went direct, KV transfer dropped by 66.7%, and turn2-4 TTFT stayed around 0.19s.

kvcache-centric becomes worse than pd-disaggregation when:

  • The decode node is the bottleneck.
  • Direct append-prefill consumes decode KV and decode compute at the same time.
  • Some requests still go direct while others fall back to PD.
  • That mixed regime is the worst case: direct requests occupy decode, then fallback requests pay full PD transfer cost on top of an already busy decode.
  • That is what auto5 shows. It avoided pool-full events, but it rejected too many direct hits. Late turns then became mostly PD fallback with very large cached contexts, so TTFT exploded:
    • turn9-12 mean TTFT 33.43s
    • turn13-16 mean TTFT 53.03s
  • In auto5, the bad tail is not direct requests. It is fallback traffic:
    • pd-router-fallback-d-backpressure mean TTFT 39.18s
    • pd-router-fallback-no-d-capacity mean TTFT 32.60s
  • So the real problem is not “direct-to-D is slow”. The problem is “once D gets too full, fallback happens too late, and fallback requests are then huge”.

Why kvcache-centric can beat PD in principle:

  • It removes the extra RDMA KV hop on reuse turns.
  • It avoids re-prefilling the whole reused prefix on P.
  • It reduces TTFT sharply when append is small.
  • It can reduce P-node pressure for multi-turn agentic traces.

Why it can lose in principle:

  • D has finite KV capacity, while session lengths keep growing every turn.
  • D-local append-prefill is not free; it competes with decode for the same GPU.
  • If eviction is naive, you either keep too many stale sessions or evict useful ones and reseed later.
  • If admission is too loose, you hit decode retraction.
  • If admission is too conservative, you lose the direct-hit advantage and revert to slow PD fallback.

What To Optimize Next

  • Use benefit-cost admission, not a binary threshold.
    • Direct only if saved KV transfer + saved P queueing is larger than added decode-prefill cost + added decode-pressure risk.
  • Keep separate rules for direct reuse and seed/reseed.
    • This is now coded in src/agentic_pd_hybrid/replay.py, because seed is much more expensive than direct reuse.
  • Add hysteresis.
    • Stop new seeds earlier.
    • Stop direct reuse only near true danger.
    • Resume only after decode load falls well below the stop threshold.
  • Make the append threshold dynamic.
    • Low load: allow append up to 2k.
    • Medium load: maybe 512-1k.
    • High load: 0, force PD.
  • Replace plain LRU eviction with reuse-aware eviction.
    • Score by recent access, current prefix size, and expected future turns.
    • Evict low-future-value sessions first.
  • Prefer adding D capacity over P capacity for this workload.
    • The late-turn problem here is decode KV pressure, not P throughput.
    • More P helps turn1-heavy traces.
    • More D helps reuse-heavy multi-turn traces like this one.
  • If scaling nodes, use kvcache-centric only on sessions whose decode home is not near pressure.
    • Otherwise route them back to PD early, before the context becomes huge.

Bottom Line

  • Best measured performance result right now is still auto4, but it is not safe enough.
  • Best measured safe result is auto5, but it is too conservative and slower than baseline PD.
  • The correct direction is not “always direct” and not “always back off”.
  • The correct policy is “direct reuse aggressively when D is healthy, but stop seeding much earlier than you stop direct reuse”.
  • That split policy is now implemented in src/agentic_pd_hybrid/replay.py. The validation run for it is still in progress.
Setup 总显存 模型权重占用 可留给 KVCache 的空间(理论上限) 每 token KV 大小 可存 tokens理论上限 可存 blocksblock size = 16 可存 blocksblock size = 32
GLM-5.1-FP8 @ 8×H20 141GB 1128 GB 756 GB 372 GB 89{,}856 B \approx 87.75 KiB \approx 4.14 M \approx 258.7 K \approx 129.4 K
Qwen3-Coder-480B-FP8 @ 8×H20 141GB 1128 GB 482 GB 646 GB 253{,}952 B \approx 248.0 KiB \approx 2.54 M \approx 159.0 K \approx 79.5 K
GLM-5.1-FP8 @ 8×B300 288GB 2304 GB 756 GB 1548 GB 89{,}856 B \approx 87.75 KiB \approx 17.23 M \approx 1076.7 K \approx 538.4 K

workload 出图: 基本信息:

  1. input/output length cdf同时给出 80% zoom in 的 cdfinput/output length 图片下方分别标注 mean/p50/p80/p90/p95/p99 session 相关:
  2. session turns 的 cdf给出前 10% turns 的 zoom in同样在图片下方标注 mean/p50/p90/p95/p99 turns
  3. mean/p50/p99 请求长度随着 turns 变化的折线图(取所有 turns 相同的请求为一类进行计算) tool 相关:
  4. tool/user/… 各类 role 触发的请求比例的饼图,标出百分比和数据
  5. tool call 的 output length 的 cdf同样标出 mean/p50/p90/p95/p99
  6. tool call 耗时tool call 请求返回到 tool call 结果返回触发下一轮请求的时间间隔)的 cdf给出 90% zoom in同样在图片下方标出 mean/p50/p90/p95/p99
  7. 一次 user 的 input 后连续的 tool call 次数的 cdf同样在图片下方标出 mean/p50/p90/p95/p99
  8. session 中tool call 触发的下一轮对话的新增 contextinput length 相比上一轮 output length的 cdf同样给出 mean/p50/p90/p95/p99 kvcache 相关:
  9. kvcache block reuse time 的 cdf相同的 kvcache block每次 reuse 与上一次 reuse/第一次产生 的时间间隔记作一次),给出 90% 的 zoom in同样在图片下方标注 mean/p50/p90/p95/p99
  10. kvcache block 的生命周期(第一次出现到最后一次 reuse 的时间间隔)的 cdf同样标出 mean/p50/p90/p95/p99
  11. 存活 kvcache blocks 数量随着时间变化的折线图x 轴用 10 分钟一个 tick 分桶相关:
  12. 每个分桶的请求独立只能复用相同桶内的 prefix kvcache 的情况下每个分桶内的 kvcache reuse ratio以及不分桶所有请求总的 kvcache reuse ratio给出柱状图同时标出每个分桶/不分桶的 reused blocks 总数
  13. session 前后轮请求跨分桶时引起的 kvcache miss 以及减少的 kvcache reused blocks 在对应分桶总 kvcache reuse 的比例

service_name = 33e6d810.deployment.qwen-max-mainline-chat-* and response_params not null

{
  "kv-transfer-config":{
    "kv_connector": "DecodeBenchConnector",
    "kv_role": "kv_both"
  }
}

DS_LLM_DISAGGREGATED_DECODE_NODE enable_disaggregated_prefilling kv_transfer_config DS_LLM_FORCED_MAX_NEW_TOKENS=1 DS_LLM_FORCED_MAX_NEW_TOKENS

DASHGEN_DEPLOYMENT_ROLE=prefill

export DASH_GEN_ARGS=${DASH_GEN_ARGS:-'{"enable_disaggregated_prefilling": 1, "enable_think": 0, "think_mode": "auto", "gpu_memory_utilization": 0.85, "max_model_len": 262144, "enable_chunked_prefill": true, "speculative_config": {"method": "eagle3", "num_speculative_tokens": 1, "hf_overrides": {"rope_scaling": {"type": "yarn", "factor": 128, "original_max_position_embeddings": 2048, "semi_dynamic": false, "dynamic": true}, "num_experts": 0}, "model": "/home/admin/resource/model/464482ce.qwen3-235b-a22b/0717-eagle-0820"}, "swap_space": 0, "enable_prefix_caching": true, "max_num_batched_tokens": 8192, "enable_expert_parallel": false, "disable_hybrid_kv_cache_manager": true, "block_size": 64, "max_num_seqs": 64, "enforce_eager": false, "quantization": "fp8", "cuda_graph_sizes": [16, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800, 832, 864, 896, 928, 960, 992, 1024], "compilation_config": {"cudagraph_mode": "PIECEWISE", "use_inductor": false, "custom_ops": ["all"], "max_cudagraph_capture_size": 2048}, "kv_transfer_config": {"kv_connector": "HybridConnector", "kv_role": "kv_producer", "kv_connector_extra_config": {"backend": "kvt"}}, "hf_overrides": {"architectures": ["Qwen3MoeForCausalLM"], "model_type": "qwen3_moe"}, "kv_cache_dtype": "fp8"}'}

export DASH_GEN_ARGS=${DASH_GEN_ARGS:-'{"enable_disaggregated_prefilling": false, "enable_think": 0, "think_mode": "auto", "gpu_memory_utilization": 0.85, "max_model_len": 262144, "enable_chunked_prefill": true, "speculative_config": {"method": "eagle3", "num_speculative_tokens": 1, "hf_overrides": {"rope_scaling": {"type": "yarn", "factor": 128, "original_max_position_embeddings": 2048, "semi_dynamic": false, "dynamic": true}, "num_experts": 0}, "model": "/home/admin/resource/model/464482ce.qwen3-235b-a22b/0717-eagle-0820"}, "swap_space": 0, "enable_prefix_caching": true, "max_num_batched_tokens": 8192, "enable_expert_parallel": false, "disable_hybrid_kv_cache_manager": true, "block_size": 64, "max_num_seqs": 64, "enforce_eager": false, "quantization": "fp8", "cuda_graph_sizes": [16, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800, 832, 864, 896, 928, 960, 992, 1024], "compilation_config": {"cudagraph_mode": "PIECEWISE", "use_inductor": false, "custom_ops": ["all"], "max_cudagraph_capture_size": 2048}, "kv_transfer_config": "", "hf_overrides": {"architectures": ["Qwen3MoeForCausalLM"], "model_type": "qwen3_moe"}, "kv_cache_dtype": "fp8"}'}

coding agent 的 trace analysis

  1. GLM5 040909-040911
  2. GLM5 040915-040917
  3. Qwen3-Coder 040915-040917

Ali kvcache 现状

GLM 5 分桶: 0-32k 32k-85k 85k-128k 128k++

问题:

  • 调度器如何影响 kvcache 命中率5min TTL 会不会引起误判
  • kvcache 通过 vineyard 传输的开销
  • vineyard remote cache 的价值

trace 分析:

  1. input/output length 的 CDF
  2. session turns 的 CDF
  3. session prev turn -> next 转移的触发原因的占比柱状图
  4. kvcache block 的重用时间 CDF
  5. kvcache lifespan 的 CDF
  6. tool call 耗时的 CDF
  7. tool call 的新增 tokens 分布
  8. tool call 的新增 tokens相比当前 context 的长度占比

dash0 qwen235b prefill-only thinking tuning step slo246 dash1 上的 qwen27b chat 0-8k tuned best vs baseline dash2 qwen235b decode-only thinking tuning with 0323

baseline: 0414, 5~7 点的测试监控


Turn 1 Input:  1 2 3         In trace: 1 2 3
Turn 1 Output: 4 5 6
Turn 2 Input   7 8           Append Only Context: 1 2 3 4 5 6 7 8
                             Real Context: 1 2 4 5 6 7 8
				                      Hit: 1 2
Turn 2 Output: 9
Turn 3 Input:  10 11         Append Only Context: 1 2 3 4 5 6 7 8 9 10 11
                             Real Context: 1 2 4 5 6 7 9 10 11
				                      Hit: 1 2 4 5 6 7

Turn 1 Input:  1 2 3         In trace: 1 2 3
Turn 1 Output: 90 91 92      Output in trace: 4 5 6
Turn 2 Input   7 8           In trace: 1 2 3 4  5  6  7 8
               Real Context in Engine: 1 2 3 90 91 92 7 8
				                  Hit: 1 2 3