docs: document sglang maintenance workflow
This commit is contained in:
12
README.md
12
README.md
@@ -35,8 +35,10 @@ Sync the environment:
|
||||
uv sync
|
||||
```
|
||||
|
||||
Local experiments can use a repo-local `third_party/sglang` checkout of SGLang
|
||||
`v0.5.10`, but that heavyweight checkout is intentionally not committed here.
|
||||
`third_party/sglang` vendors a clean SGLang `v0.5.10` snapshot plus our local
|
||||
PD/session-cache patches in later commits. Keep SGLang changes scoped under that
|
||||
directory and commit them with `feat(sglang): ...` or `fix(sglang): ...` so they
|
||||
stay easy to review against the vendor baseline.
|
||||
|
||||
## CLI
|
||||
|
||||
@@ -126,9 +128,9 @@ Notes:
|
||||
- Live benchmarking uses the repo-local `agentic_pd_hybrid.pd_router`, which
|
||||
preserves the real prefill/decode double-request path over loopback without
|
||||
depending on the upstream Rust router build.
|
||||
- Managed live benchmarking prefers a local
|
||||
`third_party/sglang/python/sglang` checkout when it exists, so local SGLang
|
||||
source changes can apply immediately without packaging a wheel.
|
||||
- Managed live benchmarking prefers the vendored
|
||||
`third_party/sglang/python/sglang` source tree, so local SGLang changes apply
|
||||
immediately without packaging a wheel.
|
||||
- Live benchmarking currently targets the `mooncake` transfer backend, because
|
||||
`mooncake-transfer-engine` is installed and usable on this node.
|
||||
- `benchmark-live` and `replay` support streaming by default for TTFT/TPOT
|
||||
|
||||
@@ -5,9 +5,11 @@ session-aware and KV-cache-aware prefill/decode routing can improve end-to-end
|
||||
latency for agentic coding workloads on top of SGLang xPyD.
|
||||
|
||||
The current target environment is a single 8-GPU node running SGLang `v0.5.10`
|
||||
with Qwen3-Coder-30B-A3B-Instruct. The local setup keeps the P -> D transfer
|
||||
path through SGLang disaggregation and Mooncake loopback instead of replacing it
|
||||
with an in-process shortcut.
|
||||
with Qwen3-Coder-30B-A3B-Instruct. The repo vendors SGLang under
|
||||
`third_party/sglang` so our xPyD/session-cache changes are maintained together
|
||||
with the benchmark harness. The local setup keeps the P -> D transfer path
|
||||
through SGLang disaggregation and Mooncake loopback instead of replacing it with
|
||||
an in-process shortcut.
|
||||
|
||||
## Design
|
||||
|
||||
@@ -57,6 +59,24 @@ The prototype currently includes:
|
||||
disaggregation wait timeout to avoid treating transfer hangs as successful
|
||||
long-tail responses.
|
||||
|
||||
## SGLang Maintenance
|
||||
|
||||
SGLang is tracked directly in this repository:
|
||||
|
||||
- `chore: vendor sglang v0.5.10 snapshot` records the clean upstream baseline.
|
||||
- Later `feat(sglang): ...` / `fix(sglang): ...` commits should contain only
|
||||
local SGLang changes.
|
||||
- Generated files such as `__pycache__` and benchmark outputs stay ignored.
|
||||
|
||||
The current SGLang patch adds the worker-side mechanisms needed by
|
||||
KV-cache-centric experiments:
|
||||
|
||||
- decode workers can optionally accept local append-prefill requests in PD mode;
|
||||
- streaming session cache status is exposed for router/admission decisions;
|
||||
- idle streaming sessions can be evicted at session granularity;
|
||||
- direct append admission can check resident session state and D token pressure
|
||||
before the replay path bypasses P.
|
||||
|
||||
## Current Findings
|
||||
|
||||
The micro-benchmark can make KV-cache-centric routing look better than
|
||||
|
||||
Reference in New Issue
Block a user