docs: document sglang maintenance workflow

This commit is contained in:
2026-04-24 12:31:32 +00:00
parent b8e6f13c20
commit 5bdc0ed4f0
2 changed files with 30 additions and 8 deletions

View File

@@ -5,9 +5,11 @@ session-aware and KV-cache-aware prefill/decode routing can improve end-to-end
latency for agentic coding workloads on top of SGLang xPyD.
The current target environment is a single 8-GPU node running SGLang `v0.5.10`
with Qwen3-Coder-30B-A3B-Instruct. The local setup keeps the P -> D transfer
path through SGLang disaggregation and Mooncake loopback instead of replacing it
with an in-process shortcut.
with Qwen3-Coder-30B-A3B-Instruct. The repo vendors SGLang under
`third_party/sglang` so our xPyD/session-cache changes are maintained together
with the benchmark harness. The local setup keeps the P -> D transfer path
through SGLang disaggregation and Mooncake loopback instead of replacing it with
an in-process shortcut.
## Design
@@ -57,6 +59,24 @@ The prototype currently includes:
disaggregation wait timeout to avoid treating transfer hangs as successful
long-tail responses.
## SGLang Maintenance
SGLang is tracked directly in this repository:
- `chore: vendor sglang v0.5.10 snapshot` records the clean upstream baseline.
- Later `feat(sglang): ...` / `fix(sglang): ...` commits should contain only
local SGLang changes.
- Generated files such as `__pycache__` and benchmark outputs stay ignored.
The current SGLang patch adds the worker-side mechanisms needed by
KV-cache-centric experiments:
- decode workers can optionally accept local append-prefill requests in PD mode;
- streaming session cache status is exposed for router/admission decisions;
- idle streaming sessions can be evicted at session granularity;
- direct append admission can check resident session state and D token pressure
before the replay path bypasses P.
## Current Findings
The micro-benchmark can make KV-cache-centric routing look better than