docs: document sglang maintenance workflow

This commit is contained in:
2026-04-24 12:31:32 +00:00
parent b8e6f13c20
commit 5bdc0ed4f0
2 changed files with 30 additions and 8 deletions

View File

@@ -35,8 +35,10 @@ Sync the environment:
uv sync
```
Local experiments can use a repo-local `third_party/sglang` checkout of SGLang
`v0.5.10`, but that heavyweight checkout is intentionally not committed here.
`third_party/sglang` vendors a clean SGLang `v0.5.10` snapshot plus our local
PD/session-cache patches in later commits. Keep SGLang changes scoped under that
directory and commit them with `feat(sglang): ...` or `fix(sglang): ...` so they
stay easy to review against the vendor baseline.
## CLI
@@ -126,9 +128,9 @@ Notes:
- Live benchmarking uses the repo-local `agentic_pd_hybrid.pd_router`, which
preserves the real prefill/decode double-request path over loopback without
depending on the upstream Rust router build.
- Managed live benchmarking prefers a local
`third_party/sglang/python/sglang` checkout when it exists, so local SGLang
source changes can apply immediately without packaging a wheel.
- Managed live benchmarking prefers the vendored
`third_party/sglang/python/sglang` source tree, so local SGLang changes apply
immediately without packaging a wheel.
- Live benchmarking currently targets the `mooncake` transfer backend, because
`mooncake-transfer-engine` is installed and usable on this node.
- `benchmark-live` and `replay` support streaming by default for TTFT/TPOT

View File

@@ -5,9 +5,11 @@ session-aware and KV-cache-aware prefill/decode routing can improve end-to-end
latency for agentic coding workloads on top of SGLang xPyD.
The current target environment is a single 8-GPU node running SGLang `v0.5.10`
with Qwen3-Coder-30B-A3B-Instruct. The local setup keeps the P -> D transfer
path through SGLang disaggregation and Mooncake loopback instead of replacing it
with an in-process shortcut.
with Qwen3-Coder-30B-A3B-Instruct. The repo vendors SGLang under
`third_party/sglang` so our xPyD/session-cache changes are maintained together
with the benchmark harness. The local setup keeps the P -> D transfer path
through SGLang disaggregation and Mooncake loopback instead of replacing it with
an in-process shortcut.
## Design
@@ -57,6 +59,24 @@ The prototype currently includes:
disaggregation wait timeout to avoid treating transfer hangs as successful
long-tail responses.
## SGLang Maintenance
SGLang is tracked directly in this repository:
- `chore: vendor sglang v0.5.10 snapshot` records the clean upstream baseline.
- Later `feat(sglang): ...` / `fix(sglang): ...` commits should contain only
local SGLang changes.
- Generated files such as `__pycache__` and benchmark outputs stay ignored.
The current SGLang patch adds the worker-side mechanisms needed by
KV-cache-centric experiments:
- decode workers can optionally accept local append-prefill requests in PD mode;
- streaming session cache status is exposed for router/admission decisions;
- idle streaming sessions can be evicted at session granularity;
- direct append admission can check resident session state and D token pressure
before the replay path bypasses P.
## Current Findings
The micro-benchmark can make KV-cache-centric routing look better than