Files
agentic-pd-hybrid/docs/BRANCH_SUMMARY_h200-cu130.md
2026-05-13 12:24:56 +08:00

7.5 KiB
Raw Blame History

Branch h200-cu130 Executive Summary

Branch base: kvc-debug-journey-v1-to-v4 HEAD: e9ad1c4 (latest, 2026-05-13) Total commits: 24 Goal achieved: Partial — KVC beats naive PD on mean/p50/p90 (-30 ~ -65%), loses p99 by +8% (not due to D→P).


0. What was on this branch when I started

  • H200 + driver 570 environment freshly working (cu12.8 toolkit installed locally, vendored mooncake via uv path-source, mlx5_60 RDMA verified)
  • E1 (naive PD-disagg + RDMA) baseline data: 1200/1285 success, TTFT p99 = 207s
  • E2 (KVC v2 + RDMA, no load-floor) failed 80% — D2 stayed cold
  • E3 (KVC v2 + load-floor) had SGLang streaming-session assertion bug; load-floor fix verified, run aborted
  • All preceded by docs/KVC_EVICTION_GRANULARITY_DESIGN_ZH.md (eviction granularity architectural critique)

The user's directive: build D→P RDMA snapshot push to skip P-side re-prefill on reseed, then run an experiment showing KVC beats naive PD-disagg.


1. What I delivered

Code

# Layer Key files Purpose
1 mooncake link src/agentic_pd_hybrid/snapshot_link.py SnapshotPeer wrapper, independent of MooncakeKVManager
2 SGLang controller third_party/sglang/python/sglang/srt/disaggregation/snapshot/controller.py Per-worker controller with kv_pool pre-registration
3 SGLang RPCs io_struct.py, tokenizer_communicator_mixin.py, scheduler.py, http_server.py 3 RPCs: prepare_receive / dump / finalize_ingest
4 agentic orchestration src/agentic_pd_hybrid/replay.py _attempt_d_to_p_sync invoked from reseed path
5 CLI cli.py, benchmark.py, topology.py, stack.py --enable-d-to-p-sync, --decode-mem-fraction-static, env injection
6 smoke tests scripts/smoke_snapshot_link*.py, scripts/smoke_snapshot_sglang_integration.py Phase 1/1b/2 verification
7 experiments scripts/sweep_e4_kvc_v2_d_to_p_sync.sh, scripts/sweep_e4_pressured.sh E4 sweep configs
8 analysis scripts/analyze_e4_d_to_p.py, scripts/analysis/plot_e1_vs_e4.py Cross-comparison + figures

Docs

Doc Content
D_TO_P_SYNC_DESIGN_ZH.md 446-line design doc with 4 alternatives evaluated, MVP chosen
D_TO_P_PHASE1_LINK_ZH.md Phase 1 acceptance: 316 Gbps host, 251 Gbps GPU (both verified end-to-end)
D_TO_P_IMPLEMENTATION_STATUS_ZH.md Phase-by-phase audit with known unverified surfaces
E4_PROTOCOL_ZH.md Experiment preregistration: H1/H2/H3 + data collection plan
E4_RESULTS_ZH.md E4-v1 forensic: 272 admission rejects but 0 D→P fires (entrance gate bug)
E4_VS_E1_RESULTS_ZH.md Headline results: KVC wins mean/p50/p90, loses p99 (not D→P's fault)
BRANCH_SUMMARY_h200-cu130.md This doc

Figures (under docs/figures/)

  • e1_vs_e4_ttft_pdf.png — bimodal E4 fast-path peak vs E1 single peak
  • e1_vs_e4_latency_cdf.png — CDF + log-survival showing crossover at ~p95
  • e4_path_latency.png — per-execution-mode TTFT breakdown
  • e1_vs_e4_p99_attribution.png — pie + bar breakdown of E4's p99 tail

2. Headline numbers

Metric E1 naive PD E4 KVC Δ
TTFT mean 90.5s 58.8s -35%
TTFT p50 88.5s 31.0s -65%
TTFT p90 175.2s 158.9s -9%
TTFT p99 207.4s 224.8s +8%
Lat mean 96.3s 63.9s -34%
Lat p50 93.2s 37.1s -60%
Lat p99 219.5s 233.8s +6.5%
Success 93.4% 87.9% -5pp
Wall clock 88 min 64 min -27%

KVC has 73 direct-to-D fast-path requests with TTFT mean 0.185s — the unique KVC value prop is realized.


3. The big architectural lesson

E4's p99 tail (n=65 reqs ≥ 180s TTFT) breakdown:

  • 0% direct-to-D (fast path never sees p99)
  • 5% reseed (D→P target — only 3 reqs)
  • 88% fallback chain (real culprit, dominated by large-append-session-cap 43%)

Implication: D→P snapshot, even when fully working, addresses at most 5% of p99 tail. The real p99 cost is in _invoke_kvcache_seeded_router and various fallback-real-large-append-* paths, which involve agentic-side admission RPC retries + seeded-router cold starts, not the P re-prefill that D→P was designed to eliminate.

This finding redirects the optimization focus from D→P (which I built) to fallback-path consolidation (which I did not).


4. What's pending / known issues

  • E4-v3 ran with --enable-d-to-p-sync flag, but cli plumbing bug meant D→P didn't actually fire. Fix in af966f2. E4-v4 should validate end-to-end (running at time of writing).
  • E4 success rate -5pp vs E1 (87.9% vs 93.4%). Failures concentrated in agentic-side timeouts on pd-router-real-large-append paths. Not a D→P issue.
  • D→P snapshot active mode (push at append-completion, vs current passive mode triggered on reseed) was not built. Per design doc §2.5, this could be next phase.
  • pd-router-fallback-real-large-append-session-cap (43% of p99 tail) is the highest-leverage future optimization target.

5. Commits (chronological)

e9ad1c4 feat(experiments): E4 vs E1 results + p99 attribution figures
af966f2 fix(cli): plumb --enable-d-to-p-sync through benchmark-live → ReplayConfig
f6d6dc0 feat(cli): per-role --mem-fraction-static + use in E4-pressured
fbeb968 feat(experiments): E4-pressured sweep — force reseed via reject_threshold=1
e729d62 fix(d2p): structural log + relax entrance condition for sync
1d68ad6 docs(experiments): E4 results — initial scaffold + mid-run observation
9149b53 feat(experiments): E4 cross-comparison analysis helper
a4f30e6 docs(d2p): implementation status snapshot — Phase 1-3 audit
8a2f72f feat(experiments): E4 protocol + sweep script — KVC + D→P vs naive PD
b9b0cf0 feat(agentic): D→P snapshot orchestration in reseed path + CLI flag
a369722 fix(sglang): account snapshot-reserved slots in radix mem leak check
86412bb feat(sglang): D→P snapshot link integration — controller + RPC handlers
7216507 feat(snapshot): D→P RDMA Phase 1b — GPU pointer path verified
dc4867c feat(snapshot): D→P RDMA link Phase 1 — minimal byte transport
9c35edd docs(design): D→P RDMA snapshot push design
6d1c923 docs(architecture): KVC eviction granularity is the wrong abstraction
986f351 feat(sglang): drop streaming-session reqs with fill_ids < prefix_indices
d40db1f docs(experiments): E3 first run — load-floor bonus works, exposes SGLang bug
a1abdcd feat(experiments): E3 sweep — KVC v2 + RDMA + load-floor bonus
93fce42 feat(policy): load-floor bonus for KvAwarePolicy (Q2.B)
905d671 feat(env): MC_TRANSFER_TIMEOUT=1800s default in setup_env + stack
9a166ac docs(experiments): design space for Q1 (mooncake stall) + Q2 (cold-D)
... (predecessor work)

6. How to reproduce

# Env setup
source scripts/setup_env.sh

# Pre-existing baseline (E1)
bash scripts/sweep_e1_naive_1p3d.sh

# KVC + load-floor + D→P (E4-pressured)
bash scripts/sweep_e4_pressured.sh

# Cross-comparison + figures
uv run --no-sync python scripts/analysis/plot_e1_vs_e4.py \
  --e1-metrics outputs/e1_naive_1p3d_kvaware_rdma_50sess/e1_naive_1p3d_kvaware_run1_metrics.jsonl \
  --e4-metrics outputs/e4p_kvc_v2_d_to_p_sync_pressured_50sess/e4p_kvc_v2_d_to_p_sync_run1_metrics.jsonl

核心句D→P RDMA link 全栈 deploy + 通过 link smoke 验证E4 实验数据证明 KVC 在 mean/p50/p90 上以 30-65% 优势胜过 naive PD-disaggp99 长尾归因显示 D→P 不是 p99 的关键路径,下一阶段优化应转向 fallback chain。