7590e55189c876f78d2db9290f5b04bf06af8290
Two cleanups:
1. Drop "E1: naive 1P3D default" experiment from the onboarding manual.
GPU hours are precious; naive 1P3D + policy=default has near-certain
loss on multi-turn cache hit (it's round-robin without prefix awareness),
so the comparison doesn't add information vs E1=naive 1P3D kv-aware.
The new manifest has only 2 runs: E1 (naive 1P3D kv-aware) + E2 (KVC
v2 + RDMA). Run-time budget drops from 16.5h serial to 11h serial /
5.5h parallel. Updated:
- §0 TL;DR ("3 组" -> "2 组")
- §2 H1 hypothesis (drop "default and kv-aware each one" -> just kv-aware)
- §3.1 experiment matrix (3 rows -> 2 rows + rationale for the drop)
- §3.2 startup config (drop E1 default section, renumber E2/E3 -> E1/E2)
- §6 decision table + expected-range table
- §7 FAQ ("3 个 E1-E3" -> "2 个 E1-E2")
- §9 deliverables
2. Move 8 deprecated docs to docs/archive/:
AGENTIC_FIT_ANALYSIS_ZH.md (ts=10 era analysis; superseded)
STRUCTURAL_VALIDATION_REPORT_ZH.md (ts=10 era validation; superseded)
KVC_DEBUG_JOURNEY_V1_TO_V5.md (v1-v5 sweep process notes)
V5_PROFILE_INVESTIGATION_ZH.md (v5 1Hz polling investigation)
REFACTOR_PLAN_ZH.md (v0 plan; superseded by V1)
KVCACHE_CENTRIC_PROGRESS_ZH.md (earliest 2026-04-27 progress)
SWEBENCH_EXPERIMENT_PROGRESS.md (early SWE trace setup)
SWEBENCH_EXPERIMENT_RESULTS.md (early SWE result snapshot)
All cross-references in active docs (V2_DEEP_ANALYSIS / V2_RESULTS /
REFACTOR_PLAN_V1 / TEAM_REPORT / ONBOARDING) rewritten from
`docs/FOO.md` to `docs/archive/FOO.md` via sed pass.
Added `docs/archive/README.md` explaining what each archived doc is
and when (if ever) to reopen it. Designed so a new reader hitting
the archive dir immediately knows it's not required reading.
After this commit the active docs in docs/ are 9 files (down from 17),
which should make the onboarding doc's "Level 1 / Level 2 / Level 3"
classification self-evident.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Agentic PD Hybrid
这个项目是在 SGLang xPyD 上做一个最小实验框架,用来判断:
面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing,能不能降低端到端延迟。
更完整但仍然简洁的说明见 docs/PROJECT_OVERVIEW.md。
当前做了什么
- 启动单机 SGLang P/D 栈。
- 回放 Ali coding agent trace,并记录 request-level metrics。
- 支持
default、sticky、kv-aware路由策略。 - 支持
pd-disaggregation、kvcache-centric、pd-colo对比。 - 支持小 append、多轮 session 的 micro-benchmark trace。
- 维护了基于 SGLang
v0.5.10的本地 patch,放在third_party/sglang。
环境
统一使用 uv:
uv sync
默认模型路径:
~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
当前主要测试环境是单机 8 GPU,约束是 prefill + decode <= 8。
常用命令
生成小 append trace:
uv run agentic-pd-hybrid make-small-append-trace \
--output outputs/smoke-hotcap-30k-1k-256.jsonl \
--session-count 4 \
--turns-per-session 3 \
--initial-input-length 30000 \
--append-input-length 1000 \
--output-length 256
跑 live benchmark:
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
--output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
--mechanism kvcache-centric \
--policy kv-aware \
--kvcache-admission-mode worker \
--prefill-workers 1 \
--decode-workers 1 \
--prefill-gpu-ids 0 \
--decode-gpu-ids 1 \
--transfer-backend mooncake \
--target-duration-s 2000 \
--session-sample-rate 1.0 \
--min-turns 2 \
--time-scale 1 \
--concurrency-limit 1000
只回放并写 metrics:
uv run agentic-pd-hybrid replay \
--trace path/to/trace.jsonl \
--policy kv-aware \
--mechanism pd-disaggregation \
--router-url http://127.0.0.1:8000 \
--output outputs/replay.jsonl
输出
每次 replay/benchmark 会写:
- request metrics:
request-metrics.jsonl - 汇总结果:
request-metrics.jsonl.summary.json
重点看:
- E2E latency
- TTFT / TPOT
- execution mode
- cached tokens
- KV transfer blocks
- error
维护约定
- 项目代码改动:
feat:/fix:/docs:。 - SGLang 改动:
feat(sglang): .../fix(sglang): ...。 third_party/sglang的基线是 clean SGLangv0.5.10snapshot。- 不提交
outputs/、日志、__pycache__、虚拟环境。
Description
Languages
Python
100%