docs: document sglang maintenance workflow

2026-04-24 12:31:32 +00:00
parent b8e6f13c20
commit 5bdc0ed4f0
2 changed files with 30 additions and 8 deletions
--- a/docs/PROJECT_OVERVIEW.md
+++ b/docs/PROJECT_OVERVIEW.md
@@ -5,9 +5,11 @@ session-aware and KV-cache-aware prefill/decode routing can improve end-to-end
 latency for agentic coding workloads on top of SGLang xPyD.

 The current target environment is a single 8-GPU node running SGLang `v0.5.10`
-with Qwen3-Coder-30B-A3B-Instruct. The local setup keeps the P -> D transfer
-path through SGLang disaggregation and Mooncake loopback instead of replacing it
-with an in-process shortcut.
+with Qwen3-Coder-30B-A3B-Instruct. The repo vendors SGLang under
+`third_party/sglang` so our xPyD/session-cache changes are maintained together
+with the benchmark harness. The local setup keeps the P -> D transfer path
+through SGLang disaggregation and Mooncake loopback instead of replacing it with
+an in-process shortcut.

 ## Design

@@ -57,6 +59,24 @@ The prototype currently includes:
  disaggregation wait timeout to avoid treating transfer hangs as successful
  long-tail responses.

+## SGLang Maintenance
+
+SGLang is tracked directly in this repository:
+
+- `chore: vendor sglang v0.5.10 snapshot` records the clean upstream baseline.
+- Later `feat(sglang): ...` / `fix(sglang): ...` commits should contain only
+  local SGLang changes.
+- Generated files such as `__pycache__` and benchmark outputs stay ignored.
+
+The current SGLang patch adds the worker-side mechanisms needed by
+KV-cache-centric experiments:
+
+- decode workers can optionally accept local append-prefill requests in PD mode;
+- streaming session cache status is exposed for router/admission decisions;
+- idle streaming sessions can be evicted at session granularity;
+- direct append admission can check resident session state and D token pressure
+  before the replay path bypasses P.
+
 ## Current Findings

 The micro-benchmark can make KV-cache-centric routing look better than