110bd68000c803e2a657f974af62e45860fa545c
Consolidates failure modes scattered across V2_DEEP_ANALYSIS,
E1_E2_RESULTS, E3_FINDINGS, KVC_EVICTION_GRANULARITY,
REAL_ALI_KVC_EXPERIMENT into a single lookup table with
five fields per mode: symptom → root cause → trigger →
current mitigation → real fix.
Five modes covered:
A. Mooncake "instance not alive" cascade
— E2 80%-failure pathology; admission no-space →
seed burst → heartbeat drop → batch abort
B. Cold-D / overlap-pinning
— shared boilerplate hash pins all sessions to a
subset of D's; load_floor_bonus is a patch, the
real fix is exclusive_overlap redefinition
C. Evict storm (session-level eviction)
— release_session frees 38–88K tokens in one shot;
fix is BLOCK_LEVEL_EVICTION_DESIGN
C'. Reseed storm (turn-1 concurrent seeds)
— startup-phase mooncake burst; fix is per-D
pending-seed budget, frequency drops after C
D. Streaming-session correction invariant crash (E3)
— schedule_batch.py:1646 landmine, hotfixed by
986f351, root-fix is removing the correction
path entirely (BLOCK_LEVEL_EVICTION §3.7)
Each mode has a forensic link back to the original
experiment doc that surfaced it.
§6 adds a diagnostic cheat sheet: "if you see X, look at Y."
§7 wires every mode to a roadmap item — Milestone 1 should
graduate §1–§4 to "mitigated" and eliminate §5.
INDEX_ZH gets a new §1.6 section linking this and the
SGLang patch inventory.
No code change. Reading dependency for anyone debugging
a sweep or writing paper §Limitations.
Agentic PD Hybrid
这个项目是在 SGLang xPyD 上做一个最小实验框架,用来判断:
面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing,能不能降低端到端延迟。
更完整但仍然简洁的说明见 docs/PROJECT_OVERVIEW.md。
新加入的合作者:先看 docs/INDEX_ZH.md,按"我是谁"选 3 篇必读文档。 项目当前进度、薄弱点、路线图总览见 docs/AUDIT_AND_ROADMAP_ZH.md。
当前做了什么
- 启动单机 SGLang P/D 栈。
- 回放 Ali coding agent trace,并记录 request-level metrics。
- 支持
default、sticky、kv-aware路由策略。 - 支持
pd-disaggregation、kvcache-centric、pd-colo对比。 - 支持小 append、多轮 session 的 micro-benchmark trace。
- 维护了基于 SGLang
v0.5.10的本地 patch,放在third_party/sglang。
环境
统一使用 uv:
uv sync
默认模型路径:
~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
当前主要测试环境是单机 8 GPU,约束是 prefill + decode <= 8。
常用命令
生成小 append trace:
uv run agentic-pd-hybrid make-small-append-trace \
--output outputs/smoke-hotcap-30k-1k-256.jsonl \
--session-count 4 \
--turns-per-session 3 \
--initial-input-length 30000 \
--append-input-length 1000 \
--output-length 256
跑 live benchmark:
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
--output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
--mechanism kvcache-centric \
--policy kv-aware \
--kvcache-admission-mode worker \
--prefill-workers 1 \
--decode-workers 1 \
--prefill-gpu-ids 0 \
--decode-gpu-ids 1 \
--transfer-backend mooncake \
--target-duration-s 2000 \
--session-sample-rate 1.0 \
--min-turns 2 \
--time-scale 1 \
--concurrency-limit 1000
只回放并写 metrics:
uv run agentic-pd-hybrid replay \
--trace path/to/trace.jsonl \
--policy kv-aware \
--mechanism pd-disaggregation \
--router-url http://127.0.0.1:8000 \
--output outputs/replay.jsonl
输出
每次 replay/benchmark 会写:
- request metrics:
request-metrics.jsonl - 汇总结果:
request-metrics.jsonl.summary.json
重点看:
- E2E latency
- TTFT / TPOT
- execution mode
- cached tokens
- KV transfer blocks
- error
维护约定
- 项目代码改动:
feat:/fix:/docs:。 - SGLang 改动:
feat(sglang): .../fix(sglang): ...。 third_party/sglang的基线是 clean SGLangv0.5.10snapshot。- 不提交
outputs/、日志、__pycache__、虚拟环境。
单元测试(无 GPU)
算法层(policies、Algorithm 1 / Theorem 1)有 pure-Python 单测,跑测试不需要 GPU、不需要 SGLang:
uv sync --group test
uv run pytest
详见 tests/README.md。
评测脚本
按 docs/EVALUATION_PROTOCOL_ZH.md 跑数据后:
# M3: 按 turn_id / input_length / overlap_ratio / append_tokens 分桶
scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl
# M2: paired-on-same-trial bootstrap 95% CI
scripts/analysis/paired_compare.py \
--baseline outputs/run-dp/request-metrics.jsonl \
--candidate outputs/run-kvc/request-metrics.jsonl
Description
Languages
Python
100%