Commit Graph

14 Commits

Author SHA1 Message Date
kzlin
6e5ed8da80 feat(kvc): Option D - delegate seed/reseed admission to D worker
v4 (cap=16) saw 35% session-cap fallback because the local soft_cap
min(16, usable / target) evaluates to 1-2 for large agentic inputs.
The cap was hit not because D was full but because replay's heuristic
underestimated capacity.

This change makes worker admission_mode authoritative for ALL paths:

SGLang side:
- io_struct.py: DirectAppendAdmissionReqInput gains a `mode` field
  ("direct_append" | "seed", default "direct_append" preserves prior
  behavior).
- scheduler.py:admit_direct_append: when mode == "seed", skip the
  resident-on-D requirement and run the same capacity check + LRU
  eviction (maybe_trim_decode_session_cache) that direct_append uses.
  This lets D atomically decide if a new session can be admitted based
  on actual token_to_kv_pool_allocator state.

Replay side (replay.py):
- _query_decode_direct_admission gains a `mode` parameter.
- _reserve_decode_session_capacity: in worker admission_mode, the
  seed/reseed branch now queries D with mode="seed" and trusts the
  result, instead of estimating capacity from the residency snapshot.
- _should_admit_new_decode_session: in worker mode, skip the local
  soft_cap pre-check and let D decide. Same-D session fast-path is
  preserved.

Effects:
- Local hardcoded cap of 16 is bypassed under worker mode; D's real
  KV pool size is the only constraint.
- LRU eviction runs in D's process atomically with admission, so
  starvation (the v3 bimodal "lucky vs starved sessions" pattern)
  should resolve.

scripts/sweep_tp1_v5_optD.sh added to run the same 1P7D / 2P6D
configs as v4 with the new admission path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:40:03 +08:00
kzlin
74194e660a docs: v4 final results, error analysis, and updated journey
Add v4 sweep results and post-mortem analysis showing:

- direct-to-D path: 54.3% (1P7D) / 58.0% (2P6D) of requests now use
  KVC cleanly. P50=0.5s and TTFT P50=0.043s; this path beats baseline
  8DP across the board (P50 -24%, TTFT P50 -54%, TTFT P90 -79%).

- Overall vs baseline (errors+truncated excluded):
  v4 2P6D P50=0.85s vs baseline 0.66s (28% slower).
  Reason is not errors -- 35% of requests still hit
  fallback-large-append-session-cap, where capacity-based
  cap = usable_tokens / target_tokens evaluates to 1-2 (not 16)
  for large agentic inputs.

- 9-10% errors on KVC variants are mooncake TCP transfer timeouts,
  not SGLang logic bugs. Prefill log shows
  "Failed to send kv chunk ... 32s timeout ... session not alive".
  Errors concentrate in turn>=31 (large inputs) after run >44.8%.

Track:
- docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: append v4 results table,
  per-mode breakdown, and error root cause.
- scripts/analysis/{analyze_v3,analyze_v4,analyze_errors,compare_no_error}.py
- outputs/qwen3-30b-tp1-v{3,4}*/exp*_summary.json (force-added,
  small JSON; metrics.jsonl excluded due to size).
- outputs/qwen3-30b-tp1-v{3,4}*/sweep_results.txt

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:34:01 +08:00
kzlin
c9d350b372 docs: KVC v1-v4 debug journey + raise session soft_cap to 16
Document the iterative debugging from v1 (broken KVC) through v4
(routing fixed + session cap raised), with code-level analysis of
the two main bugs encountered:

1. v2 root cause (mis-diagnosed previously as `allow_local_prefill`):
   `--policy default` for KVC mechanism caused replay's round-robin
   policy and the PD router's round-robin to diverge, sending requests
   with `session_params` to a D worker that did not have the session
   open. Resulted in 56-61% truncation with finish_reason
   "session id X does not exist".
   Fix: use `--policy kv-aware` (sweep_tp1_v3_kvaware.sh) so replay
   emits `x-smg-target-worker` and PD router uses consistent_hashing.

2. v3 new bottleneck: `pd-router-fallback-large-append-session-cap`
   dominated 52-65% of requests. Root cause was hardcoded
   `min(4, ...)` in `_decode_session_soft_cap`. With 7 D workers x 4
   sessions = 28 slots for 52 trace sessions, ~24 sessions starved
   permanently (bimodal direct-to-D rate of 0% or 99%).
   Fix: raise the cap to 16 (replay.py).

Also includes the v3 finding that direct-to-d-session path P50=0.495s
and TTFT P50=0.043s already beats the 8-way DP baseline (0.65s/0.093s)
- the KVC core mechanism works when fallback paths are avoided.

Files:
- docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: full journey + code location index
- docs/SWEBENCH_EXPERIMENT_{PROGRESS,RESULTS}.md: prior session notes
- scripts/sweep_tp1_v{2,3,4}*.sh: experiment driver scripts
- src/agentic_pd_hybrid/replay.py: cap 4 -> 16, audit fields
- src/agentic_pd_hybrid/pd_router.py: strip session_params from prefill
- src/agentic_pd_hybrid/metrics.py: truncated_request_count

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 21:10:41 +08:00
e9062b1d6e Document PD baseline comparison 2026-04-25 17:29:27 +00:00
c928c7db23 Add transfer queue admission knobs 2026-04-25 17:29:15 +00:00
fe583fb413 Document kvcache-centric experiment progress 2026-04-25 16:01:31 +00:00
13bb31a446 Add kvcache-centric profiling and admission controls 2026-04-25 16:00:52 +00:00
08b13d22bc docs: rewrite project docs in concise chinese 2026-04-24 12:41:52 +00:00
5bdc0ed4f0 docs: document sglang maintenance workflow 2026-04-24 12:31:32 +00:00
b8e6f13c20 feat(sglang): support decode session cache admission 2026-04-24 12:30:41 +00:00
bded08301f chore: vendor sglang v0.5.10 snapshot 2026-04-24 12:29:36 +00:00
78f0d15221 docs: document project design and status 2026-04-24 12:17:55 +00:00
4bca741f32 feat: add agentic pd hybrid benchmark prototype 2026-04-24 12:17:46 +00:00
d2fe014db7 chore: initialize repo hygiene 2026-04-24 12:17:40 +00:00