agentic-pd-hybrid

gahow/agentic-pd-hybrid

Fork 0

Commit Graph

Author	SHA1	Message	Date
kzlin	6e5ed8da80	feat(kvc): Option D - delegate seed/reseed admission to D worker v4 (cap=16) saw 35% session-cap fallback because the local soft_cap min(16, usable / target) evaluates to 1-2 for large agentic inputs. The cap was hit not because D was full but because replay's heuristic underestimated capacity. This change makes worker admission_mode authoritative for ALL paths: SGLang side: - io_struct.py: DirectAppendAdmissionReqInput gains a `mode` field ("direct_append" \| "seed", default "direct_append" preserves prior behavior). - scheduler.py:admit_direct_append: when mode == "seed", skip the resident-on-D requirement and run the same capacity check + LRU eviction (maybe_trim_decode_session_cache) that direct_append uses. This lets D atomically decide if a new session can be admitted based on actual token_to_kv_pool_allocator state. Replay side (replay.py): - _query_decode_direct_admission gains a `mode` parameter. - _reserve_decode_session_capacity: in worker admission_mode, the seed/reseed branch now queries D with mode="seed" and trusts the result, instead of estimating capacity from the residency snapshot. - _should_admit_new_decode_session: in worker mode, skip the local soft_cap pre-check and let D decide. Same-D session fast-path is preserved. Effects: - Local hardcoded cap of 16 is bypassed under worker mode; D's real KV pool size is the only constraint. - LRU eviction runs in D's process atomically with admission, so starvation (the v3 bimodal "lucky vs starved sessions" pattern) should resolve. scripts/sweep_tp1_v5_optD.sh added to run the same 1P7D / 2P6D configs as v4 with the new admission path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:40:03 +08:00

Author

SHA1

Message

Date

kzlin

6e5ed8da80

feat(kvc): Option D - delegate seed/reseed admission to D worker

v4 (cap=16) saw 35% session-cap fallback because the local soft_cap
min(16, usable / target) evaluates to 1-2 for large agentic inputs.
The cap was hit not because D was full but because replay's heuristic
underestimated capacity.

This change makes worker admission_mode authoritative for ALL paths:

SGLang side:
- io_struct.py: DirectAppendAdmissionReqInput gains a `mode` field
  ("direct_append" | "seed", default "direct_append" preserves prior
  behavior).
- scheduler.py:admit_direct_append: when mode == "seed", skip the
  resident-on-D requirement and run the same capacity check + LRU
  eviction (maybe_trim_decode_session_cache) that direct_append uses.
  This lets D atomically decide if a new session can be admitted based
  on actual token_to_kv_pool_allocator state.

Replay side (replay.py):
- _query_decode_direct_admission gains a `mode` parameter.
- _reserve_decode_session_capacity: in worker admission_mode, the
  seed/reseed branch now queries D with mode="seed" and trusts the
  result, instead of estimating capacity from the residency snapshot.
- _should_admit_new_decode_session: in worker mode, skip the local
  soft_cap pre-check and let D decide. Same-D session fast-path is
  preserved.

Effects:
- Local hardcoded cap of 16 is bypassed under worker mode; D's real
  KV pool size is the only constraint.
- LRU eviction runs in D's process atomically with admission, so
  starvation (the v3 bimodal "lucky vs starved sessions" pattern)
  should resolve.

scripts/sweep_tp1_v5_optD.sh added to run the same 1P7D / 2P6D
configs as v4 with the new admission path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 23:40:03 +08:00

1 Commits