kzlin 74194e660a docs: v4 final results, error analysis, and updated journey
Add v4 sweep results and post-mortem analysis showing:

- direct-to-D path: 54.3% (1P7D) / 58.0% (2P6D) of requests now use
  KVC cleanly. P50=0.5s and TTFT P50=0.043s; this path beats baseline
  8DP across the board (P50 -24%, TTFT P50 -54%, TTFT P90 -79%).

- Overall vs baseline (errors+truncated excluded):
  v4 2P6D P50=0.85s vs baseline 0.66s (28% slower).
  Reason is not errors -- 35% of requests still hit
  fallback-large-append-session-cap, where capacity-based
  cap = usable_tokens / target_tokens evaluates to 1-2 (not 16)
  for large agentic inputs.

- 9-10% errors on KVC variants are mooncake TCP transfer timeouts,
  not SGLang logic bugs. Prefill log shows
  "Failed to send kv chunk ... 32s timeout ... session not alive".
  Errors concentrate in turn>=31 (large inputs) after run >44.8%.

Track:
- docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: append v4 results table,
  per-mode breakdown, and error root cause.
- scripts/analysis/{analyze_v3,analyze_v4,analyze_errors,compare_no_error}.py
- outputs/qwen3-30b-tp1-v{3,4}*/exp*_summary.json (force-added,
  small JSON; metrics.jsonl excluded due to size).
- outputs/qwen3-30b-tp1-v{3,4}*/sweep_results.txt

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:34:01 +08:00
2026-04-24 12:17:40 +00:00

Agentic PD Hybrid

这个项目是在 SGLang xPyD 上做一个最小实验框架,用来判断:

面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing能不能降低端到端延迟。

更完整但仍然简洁的说明见 docs/PROJECT_OVERVIEW.md

当前做了什么

  • 启动单机 SGLang P/D 栈。
  • 回放 Ali coding agent trace并记录 request-level metrics。
  • 支持 defaultstickykv-aware 路由策略。
  • 支持 pd-disaggregationkvcache-centricpd-colo 对比。
  • 支持小 append、多轮 session 的 micro-benchmark trace。
  • 维护了基于 SGLang v0.5.10 的本地 patch放在 third_party/sglang

环境

统一使用 uv

uv sync

默认模型路径:

~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct

当前主要测试环境是单机 8 GPU约束是 prefill + decode <= 8

常用命令

生成小 append trace

uv run agentic-pd-hybrid make-small-append-trace \
  --output outputs/smoke-hotcap-30k-1k-256.jsonl \
  --session-count 4 \
  --turns-per-session 3 \
  --initial-input-length 30000 \
  --append-input-length 1000 \
  --output-length 256

跑 live benchmark

uv run agentic-pd-hybrid benchmark-live \
  --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
  --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
  --mechanism kvcache-centric \
  --policy kv-aware \
  --kvcache-admission-mode worker \
  --prefill-workers 1 \
  --decode-workers 1 \
  --prefill-gpu-ids 0 \
  --decode-gpu-ids 1 \
  --transfer-backend mooncake \
  --target-duration-s 2000 \
  --session-sample-rate 1.0 \
  --min-turns 2 \
  --time-scale 1 \
  --concurrency-limit 1000

只回放并写 metrics

uv run agentic-pd-hybrid replay \
  --trace path/to/trace.jsonl \
  --policy kv-aware \
  --mechanism pd-disaggregation \
  --router-url http://127.0.0.1:8000 \
  --output outputs/replay.jsonl

输出

每次 replay/benchmark 会写:

  • request metricsrequest-metrics.jsonl
  • 汇总结果:request-metrics.jsonl.summary.json

重点看:

  • E2E latency
  • TTFT / TPOT
  • execution mode
  • cached tokens
  • KV transfer blocks
  • error

维护约定

  • 项目代码改动:feat: / fix: / docs:
  • SGLang 改动:feat(sglang): ... / fix(sglang): ...
  • third_party/sglang 的基线是 clean SGLang v0.5.10 snapshot。
  • 不提交 outputs/、日志、__pycache__、虚拟环境。
Description
No description provided
Readme 18 MiB
Languages
Python 100%