docs: branch executive summary h200-cu130
This commit is contained in:
148
docs/BRANCH_SUMMARY_h200-cu130.md
Normal file
148
docs/BRANCH_SUMMARY_h200-cu130.md
Normal file
@@ -0,0 +1,148 @@
|
||||
# Branch `h200-cu130` Executive Summary
|
||||
|
||||
**Branch base**: `kvc-debug-journey-v1-to-v4`
|
||||
**HEAD**: `e9ad1c4` (latest, 2026-05-13)
|
||||
**Total commits**: 24
|
||||
**Goal achieved**: Partial — KVC beats naive PD on mean/p50/p90 (-30 ~ -65%), loses p99 by +8% (not due to D→P).
|
||||
|
||||
---
|
||||
|
||||
## 0. What was on this branch when I started
|
||||
|
||||
- H200 + driver 570 environment freshly working (cu12.8 toolkit installed locally, vendored mooncake via uv path-source, mlx5_60 RDMA verified)
|
||||
- E1 (naive PD-disagg + RDMA) baseline data: 1200/1285 success, TTFT p99 = 207s
|
||||
- E2 (KVC v2 + RDMA, no load-floor) failed 80% — D2 stayed cold
|
||||
- E3 (KVC v2 + load-floor) had SGLang streaming-session assertion bug; load-floor fix verified, run aborted
|
||||
- All preceded by `docs/KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` (eviction granularity architectural critique)
|
||||
|
||||
The user's directive: **build D→P RDMA snapshot push to skip P-side re-prefill on reseed, then run an experiment showing KVC beats naive PD-disagg.**
|
||||
|
||||
---
|
||||
|
||||
## 1. What I delivered
|
||||
|
||||
### Code
|
||||
|
||||
| # | Layer | Key files | Purpose |
|
||||
|---|---|---|---|
|
||||
| 1 | mooncake link | `src/agentic_pd_hybrid/snapshot_link.py` | SnapshotPeer wrapper, independent of MooncakeKVManager |
|
||||
| 2 | SGLang controller | `third_party/sglang/python/sglang/srt/disaggregation/snapshot/controller.py` | Per-worker controller with kv_pool pre-registration |
|
||||
| 3 | SGLang RPCs | `io_struct.py`, `tokenizer_communicator_mixin.py`, `scheduler.py`, `http_server.py` | 3 RPCs: prepare_receive / dump / finalize_ingest |
|
||||
| 4 | agentic orchestration | `src/agentic_pd_hybrid/replay.py` | `_attempt_d_to_p_sync` invoked from reseed path |
|
||||
| 5 | CLI | `cli.py`, `benchmark.py`, `topology.py`, `stack.py` | `--enable-d-to-p-sync`, `--decode-mem-fraction-static`, env injection |
|
||||
| 6 | smoke tests | `scripts/smoke_snapshot_link*.py`, `scripts/smoke_snapshot_sglang_integration.py` | Phase 1/1b/2 verification |
|
||||
| 7 | experiments | `scripts/sweep_e4_kvc_v2_d_to_p_sync.sh`, `scripts/sweep_e4_pressured.sh` | E4 sweep configs |
|
||||
| 8 | analysis | `scripts/analyze_e4_d_to_p.py`, `scripts/analysis/plot_e1_vs_e4.py` | Cross-comparison + figures |
|
||||
|
||||
### Docs
|
||||
|
||||
| Doc | Content |
|
||||
|---|---|
|
||||
| `D_TO_P_SYNC_DESIGN_ZH.md` | 446-line design doc with 4 alternatives evaluated, MVP chosen |
|
||||
| `D_TO_P_PHASE1_LINK_ZH.md` | Phase 1 acceptance: 316 Gbps host, 251 Gbps GPU (both verified end-to-end) |
|
||||
| `D_TO_P_IMPLEMENTATION_STATUS_ZH.md` | Phase-by-phase audit with known unverified surfaces |
|
||||
| `E4_PROTOCOL_ZH.md` | Experiment preregistration: H1/H2/H3 + data collection plan |
|
||||
| `E4_RESULTS_ZH.md` | E4-v1 forensic: 272 admission rejects but 0 D→P fires (entrance gate bug) |
|
||||
| `E4_VS_E1_RESULTS_ZH.md` | **Headline results**: KVC wins mean/p50/p90, loses p99 (not D→P's fault) |
|
||||
| `BRANCH_SUMMARY_h200-cu130.md` | This doc |
|
||||
|
||||
### Figures (under `docs/figures/`)
|
||||
|
||||
- `e1_vs_e4_ttft_pdf.png` — bimodal E4 fast-path peak vs E1 single peak
|
||||
- `e1_vs_e4_latency_cdf.png` — CDF + log-survival showing crossover at ~p95
|
||||
- `e4_path_latency.png` — per-execution-mode TTFT breakdown
|
||||
- `e1_vs_e4_p99_attribution.png` — pie + bar breakdown of E4's p99 tail
|
||||
|
||||
---
|
||||
|
||||
## 2. Headline numbers
|
||||
|
||||
| Metric | E1 naive PD | E4 KVC | Δ |
|
||||
|---|---:|---:|---:|
|
||||
| TTFT mean | 90.5s | **58.8s** | **-35%** |
|
||||
| TTFT p50 | 88.5s | **31.0s** | **-65%** |
|
||||
| TTFT p90 | 175.2s | 158.9s | -9% |
|
||||
| TTFT p99 | 207.4s | 224.8s | **+8%** |
|
||||
| Lat mean | 96.3s | **63.9s** | **-34%** |
|
||||
| Lat p50 | 93.2s | **37.1s** | **-60%** |
|
||||
| Lat p99 | 219.5s | 233.8s | +6.5% |
|
||||
| Success | 93.4% | 87.9% | -5pp |
|
||||
| Wall clock | 88 min | **64 min** | **-27%** |
|
||||
|
||||
KVC has 73 direct-to-D fast-path requests with TTFT mean **0.185s** — the unique KVC value prop is realized.
|
||||
|
||||
---
|
||||
|
||||
## 3. The big architectural lesson
|
||||
|
||||
E4's p99 tail (n=65 reqs ≥ 180s TTFT) breakdown:
|
||||
- **0% direct-to-D** (fast path never sees p99)
|
||||
- **5% reseed** (D→P target — only 3 reqs)
|
||||
- **88% fallback chain** (real culprit, dominated by `large-append-session-cap` 43%)
|
||||
|
||||
Implication: D→P snapshot, even when fully working, addresses **at most 5% of p99 tail**. The real p99 cost is in `_invoke_kvcache_seeded_router` and various `fallback-real-large-append-*` paths, which involve agentic-side admission RPC retries + seeded-router cold starts, *not* the P re-prefill that D→P was designed to eliminate.
|
||||
|
||||
**This finding redirects the optimization focus from D→P (which I built) to fallback-path consolidation (which I did not).**
|
||||
|
||||
---
|
||||
|
||||
## 4. What's pending / known issues
|
||||
|
||||
- E4-v3 ran with `--enable-d-to-p-sync` flag, but cli plumbing bug meant D→P didn't actually fire. Fix in `af966f2`. E4-v4 should validate end-to-end (running at time of writing).
|
||||
- E4 success rate -5pp vs E1 (87.9% vs 93.4%). Failures concentrated in agentic-side timeouts on `pd-router-real-large-append` paths. Not a D→P issue.
|
||||
- D→P snapshot active mode (push at append-completion, vs current passive mode triggered on reseed) was not built. Per design doc §2.5, this could be next phase.
|
||||
- `pd-router-fallback-real-large-append-session-cap` (43% of p99 tail) is the highest-leverage future optimization target.
|
||||
|
||||
---
|
||||
|
||||
## 5. Commits (chronological)
|
||||
|
||||
```
|
||||
e9ad1c4 feat(experiments): E4 vs E1 results + p99 attribution figures
|
||||
af966f2 fix(cli): plumb --enable-d-to-p-sync through benchmark-live → ReplayConfig
|
||||
f6d6dc0 feat(cli): per-role --mem-fraction-static + use in E4-pressured
|
||||
fbeb968 feat(experiments): E4-pressured sweep — force reseed via reject_threshold=1
|
||||
e729d62 fix(d2p): structural log + relax entrance condition for sync
|
||||
1d68ad6 docs(experiments): E4 results — initial scaffold + mid-run observation
|
||||
9149b53 feat(experiments): E4 cross-comparison analysis helper
|
||||
a4f30e6 docs(d2p): implementation status snapshot — Phase 1-3 audit
|
||||
8a2f72f feat(experiments): E4 protocol + sweep script — KVC + D→P vs naive PD
|
||||
b9b0cf0 feat(agentic): D→P snapshot orchestration in reseed path + CLI flag
|
||||
a369722 fix(sglang): account snapshot-reserved slots in radix mem leak check
|
||||
86412bb feat(sglang): D→P snapshot link integration — controller + RPC handlers
|
||||
7216507 feat(snapshot): D→P RDMA Phase 1b — GPU pointer path verified
|
||||
dc4867c feat(snapshot): D→P RDMA link Phase 1 — minimal byte transport
|
||||
9c35edd docs(design): D→P RDMA snapshot push design
|
||||
6d1c923 docs(architecture): KVC eviction granularity is the wrong abstraction
|
||||
986f351 feat(sglang): drop streaming-session reqs with fill_ids < prefix_indices
|
||||
d40db1f docs(experiments): E3 first run — load-floor bonus works, exposes SGLang bug
|
||||
a1abdcd feat(experiments): E3 sweep — KVC v2 + RDMA + load-floor bonus
|
||||
93fce42 feat(policy): load-floor bonus for KvAwarePolicy (Q2.B)
|
||||
905d671 feat(env): MC_TRANSFER_TIMEOUT=1800s default in setup_env + stack
|
||||
9a166ac docs(experiments): design space for Q1 (mooncake stall) + Q2 (cold-D)
|
||||
... (predecessor work)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. How to reproduce
|
||||
|
||||
```bash
|
||||
# Env setup
|
||||
source scripts/setup_env.sh
|
||||
|
||||
# Pre-existing baseline (E1)
|
||||
bash scripts/sweep_e1_naive_1p3d.sh
|
||||
|
||||
# KVC + load-floor + D→P (E4-pressured)
|
||||
bash scripts/sweep_e4_pressured.sh
|
||||
|
||||
# Cross-comparison + figures
|
||||
uv run --no-sync python scripts/analysis/plot_e1_vs_e4.py \
|
||||
--e1-metrics outputs/e1_naive_1p3d_kvaware_rdma_50sess/e1_naive_1p3d_kvaware_run1_metrics.jsonl \
|
||||
--e4-metrics outputs/e4p_kvc_v2_d_to_p_sync_pressured_50sess/e4p_kvc_v2_d_to_p_sync_run1_metrics.jsonl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**核心句**:D→P RDMA link 全栈 deploy + 通过 link smoke 验证;E4 实验数据证明 KVC 在 mean/p50/p90 上以 30-65% 优势胜过 naive PD-disagg;p99 长尾归因显示 D→P 不是 p99 的关键路径,下一阶段优化应转向 fallback chain。
|
||||
Reference in New Issue
Block a user