7.5 KiB
Branch h200-cu130 Executive Summary
Branch base: kvc-debug-journey-v1-to-v4
HEAD: e9ad1c4 (latest, 2026-05-13)
Total commits: 24
Goal achieved: Partial — KVC beats naive PD on mean/p50/p90 (-30 ~ -65%), loses p99 by +8% (not due to D→P).
0. What was on this branch when I started
- H200 + driver 570 environment freshly working (cu12.8 toolkit installed locally, vendored mooncake via uv path-source, mlx5_60 RDMA verified)
- E1 (naive PD-disagg + RDMA) baseline data: 1200/1285 success, TTFT p99 = 207s
- E2 (KVC v2 + RDMA, no load-floor) failed 80% — D2 stayed cold
- E3 (KVC v2 + load-floor) had SGLang streaming-session assertion bug; load-floor fix verified, run aborted
- All preceded by
docs/KVC_EVICTION_GRANULARITY_DESIGN_ZH.md(eviction granularity architectural critique)
The user's directive: build D→P RDMA snapshot push to skip P-side re-prefill on reseed, then run an experiment showing KVC beats naive PD-disagg.
1. What I delivered
Code
| # | Layer | Key files | Purpose |
|---|---|---|---|
| 1 | mooncake link | src/agentic_pd_hybrid/snapshot_link.py |
SnapshotPeer wrapper, independent of MooncakeKVManager |
| 2 | SGLang controller | third_party/sglang/python/sglang/srt/disaggregation/snapshot/controller.py |
Per-worker controller with kv_pool pre-registration |
| 3 | SGLang RPCs | io_struct.py, tokenizer_communicator_mixin.py, scheduler.py, http_server.py |
3 RPCs: prepare_receive / dump / finalize_ingest |
| 4 | agentic orchestration | src/agentic_pd_hybrid/replay.py |
_attempt_d_to_p_sync invoked from reseed path |
| 5 | CLI | cli.py, benchmark.py, topology.py, stack.py |
--enable-d-to-p-sync, --decode-mem-fraction-static, env injection |
| 6 | smoke tests | scripts/smoke_snapshot_link*.py, scripts/smoke_snapshot_sglang_integration.py |
Phase 1/1b/2 verification |
| 7 | experiments | scripts/sweep_e4_kvc_v2_d_to_p_sync.sh, scripts/sweep_e4_pressured.sh |
E4 sweep configs |
| 8 | analysis | scripts/analyze_e4_d_to_p.py, scripts/analysis/plot_e1_vs_e4.py |
Cross-comparison + figures |
Docs
| Doc | Content |
|---|---|
D_TO_P_SYNC_DESIGN_ZH.md |
446-line design doc with 4 alternatives evaluated, MVP chosen |
D_TO_P_PHASE1_LINK_ZH.md |
Phase 1 acceptance: 316 Gbps host, 251 Gbps GPU (both verified end-to-end) |
D_TO_P_IMPLEMENTATION_STATUS_ZH.md |
Phase-by-phase audit with known unverified surfaces |
E4_PROTOCOL_ZH.md |
Experiment preregistration: H1/H2/H3 + data collection plan |
E4_RESULTS_ZH.md |
E4-v1 forensic: 272 admission rejects but 0 D→P fires (entrance gate bug) |
E4_VS_E1_RESULTS_ZH.md |
Headline results: KVC wins mean/p50/p90, loses p99 (not D→P's fault) |
BRANCH_SUMMARY_h200-cu130.md |
This doc |
Figures (under docs/figures/)
e1_vs_e4_ttft_pdf.png— bimodal E4 fast-path peak vs E1 single peake1_vs_e4_latency_cdf.png— CDF + log-survival showing crossover at ~p95e4_path_latency.png— per-execution-mode TTFT breakdowne1_vs_e4_p99_attribution.png— pie + bar breakdown of E4's p99 tail
2. Headline numbers
| Metric | E1 naive PD | E4 KVC | Δ |
|---|---|---|---|
| TTFT mean | 90.5s | 58.8s | -35% |
| TTFT p50 | 88.5s | 31.0s | -65% |
| TTFT p90 | 175.2s | 158.9s | -9% |
| TTFT p99 | 207.4s | 224.8s | +8% |
| Lat mean | 96.3s | 63.9s | -34% |
| Lat p50 | 93.2s | 37.1s | -60% |
| Lat p99 | 219.5s | 233.8s | +6.5% |
| Success | 93.4% | 87.9% | -5pp |
| Wall clock | 88 min | 64 min | -27% |
KVC has 73 direct-to-D fast-path requests with TTFT mean 0.185s — the unique KVC value prop is realized.
3. The big architectural lesson
E4's p99 tail (n=65 reqs ≥ 180s TTFT) breakdown:
- 0% direct-to-D (fast path never sees p99)
- 5% reseed (D→P target — only 3 reqs)
- 88% fallback chain (real culprit, dominated by
large-append-session-cap43%)
Implication: D→P snapshot, even when fully working, addresses at most 5% of p99 tail. The real p99 cost is in _invoke_kvcache_seeded_router and various fallback-real-large-append-* paths, which involve agentic-side admission RPC retries + seeded-router cold starts, not the P re-prefill that D→P was designed to eliminate.
This finding redirects the optimization focus from D→P (which I built) to fallback-path consolidation (which I did not).
4. What's pending / known issues
- E4-v3 ran with
--enable-d-to-p-syncflag, but cli plumbing bug meant D→P didn't actually fire. Fix inaf966f2. E4-v4 should validate end-to-end (running at time of writing). - E4 success rate -5pp vs E1 (87.9% vs 93.4%). Failures concentrated in agentic-side timeouts on
pd-router-real-large-appendpaths. Not a D→P issue. - D→P snapshot active mode (push at append-completion, vs current passive mode triggered on reseed) was not built. Per design doc §2.5, this could be next phase.
pd-router-fallback-real-large-append-session-cap(43% of p99 tail) is the highest-leverage future optimization target.
5. Commits (chronological)
e9ad1c4 feat(experiments): E4 vs E1 results + p99 attribution figures
af966f2 fix(cli): plumb --enable-d-to-p-sync through benchmark-live → ReplayConfig
f6d6dc0 feat(cli): per-role --mem-fraction-static + use in E4-pressured
fbeb968 feat(experiments): E4-pressured sweep — force reseed via reject_threshold=1
e729d62 fix(d2p): structural log + relax entrance condition for sync
1d68ad6 docs(experiments): E4 results — initial scaffold + mid-run observation
9149b53 feat(experiments): E4 cross-comparison analysis helper
a4f30e6 docs(d2p): implementation status snapshot — Phase 1-3 audit
8a2f72f feat(experiments): E4 protocol + sweep script — KVC + D→P vs naive PD
b9b0cf0 feat(agentic): D→P snapshot orchestration in reseed path + CLI flag
a369722 fix(sglang): account snapshot-reserved slots in radix mem leak check
86412bb feat(sglang): D→P snapshot link integration — controller + RPC handlers
7216507 feat(snapshot): D→P RDMA Phase 1b — GPU pointer path verified
dc4867c feat(snapshot): D→P RDMA link Phase 1 — minimal byte transport
9c35edd docs(design): D→P RDMA snapshot push design
6d1c923 docs(architecture): KVC eviction granularity is the wrong abstraction
986f351 feat(sglang): drop streaming-session reqs with fill_ids < prefix_indices
d40db1f docs(experiments): E3 first run — load-floor bonus works, exposes SGLang bug
a1abdcd feat(experiments): E3 sweep — KVC v2 + RDMA + load-floor bonus
93fce42 feat(policy): load-floor bonus for KvAwarePolicy (Q2.B)
905d671 feat(env): MC_TRANSFER_TIMEOUT=1800s default in setup_env + stack
9a166ac docs(experiments): design space for Q1 (mooncake stall) + Q2 (cold-D)
... (predecessor work)
6. How to reproduce
# Env setup
source scripts/setup_env.sh
# Pre-existing baseline (E1)
bash scripts/sweep_e1_naive_1p3d.sh
# KVC + load-floor + D→P (E4-pressured)
bash scripts/sweep_e4_pressured.sh
# Cross-comparison + figures
uv run --no-sync python scripts/analysis/plot_e1_vs_e4.py \
--e1-metrics outputs/e1_naive_1p3d_kvaware_rdma_50sess/e1_naive_1p3d_kvaware_run1_metrics.jsonl \
--e4-metrics outputs/e4p_kvc_v2_d_to_p_sync_pressured_50sess/e4p_kvc_v2_d_to_p_sync_run1_metrics.jsonl
核心句:D→P RDMA link 全栈 deploy + 通过 link smoke 验证;E4 实验数据证明 KVC 在 mean/p50/p90 上以 30-65% 优势胜过 naive PD-disagg;p99 长尾归因显示 D→P 不是 p99 的关键路径,下一阶段优化应转向 fallback chain。