122 lines
4.8 KiB
Markdown
122 lines
4.8 KiB
Markdown
# Project Overview
|
|
|
|
This repository is a minimal research prototype for evaluating whether
|
|
session-aware and KV-cache-aware prefill/decode routing can improve end-to-end
|
|
latency for agentic coding workloads on top of SGLang xPyD.
|
|
|
|
The current target environment is a single 8-GPU node running SGLang `v0.5.10`
|
|
with Qwen3-Coder-30B-A3B-Instruct. The local setup keeps the P -> D transfer
|
|
path through SGLang disaggregation and Mooncake loopback instead of replacing it
|
|
with an in-process shortcut.
|
|
|
|
## Design
|
|
|
|
The code keeps policy separate from mechanism.
|
|
|
|
- Mechanism code launches SGLang workers, sends requests, manages streaming
|
|
sessions, and records request-level metrics.
|
|
- Policy code decides which prefill worker and decode worker should receive a
|
|
request.
|
|
- Replay and benchmark code preserve trace arrival times unless explicitly
|
|
configured otherwise, so concurrency comes from the workload shape rather than
|
|
from an artificial fixed-concurrency driver.
|
|
|
|
The main comparison points are:
|
|
|
|
- `pd-disaggregation`: normal router-managed P/D serving.
|
|
- `kvcache-centric`: worker/router assisted session-aware routing that can keep
|
|
a decode streaming session resident and send later small appends directly to D.
|
|
- `pd-colo`: direct colocated serving baseline for experiments that do not use
|
|
the P/D router path.
|
|
|
|
## Implemented
|
|
|
|
The prototype currently includes:
|
|
|
|
- One-node P/D launch planning and managed stack lifecycle.
|
|
- A lightweight Python PD router used for live local experiments.
|
|
- Ali trace loading, session-granularity sampling, and synthetic prompt
|
|
generation from `hash_ids`.
|
|
- Trace replay with natural pacing, request dependencies inside a session, and
|
|
request-level metrics JSONL plus summary JSON.
|
|
- Routing policies:
|
|
- `default`: simple baseline placement.
|
|
- `sticky`: turn2+ prefers the previous D node for the same session.
|
|
- `kv-aware`: uses observed block overlap/session state to choose D placement.
|
|
- Live benchmark orchestration through `benchmark-live`.
|
|
- Small-append synthetic trace generation for micro-benchmarks.
|
|
- KV-cache-centric worker admission modes:
|
|
- router shadow-state admission.
|
|
- worker queried admission.
|
|
- session-level D residency soft cap for worker-managed admission, so only a
|
|
small hot set is kept as decode streaming sessions while the rest fall back
|
|
to normal PD routing.
|
|
- P-side prefill backup bookkeeping for experiments where D evictions can retain
|
|
a lower-priority copy on P.
|
|
- Fail-fast handling for empty streaming responses and a shorter SGLang
|
|
disaggregation wait timeout to avoid treating transfer hangs as successful
|
|
long-tail responses.
|
|
|
|
## Current Findings
|
|
|
|
The micro-benchmark can make KV-cache-centric routing look better than
|
|
`pd-disaggregation` because the active sessions fit in D KV cache. Later turns
|
|
can then bypass P and use `kvcache-direct-to-d-session`, reducing TTFT.
|
|
|
|
On the larger 316-request, variable-turn workload, there are 58 sessions and the
|
|
working set is larger than the useful D residency budget. A naive worker-managed
|
|
KV-cache-centric policy repeatedly evicts and reseeds whole sessions, adding
|
|
TTFT and transfer pressure. Aggressive P-backup also increases tail risk when it
|
|
keeps too much state around.
|
|
|
|
The current soft-cap optimization improves worker-managed KV-cache-centric
|
|
relative to the older worker-managed path, but `pd-disaggregation` is still
|
|
slightly better on the sampled Ali workload because most requests fall back to
|
|
normal PD routing while a few retained D sessions still consume token budget.
|
|
|
|
## Useful Commands
|
|
|
|
Run a live benchmark with natural arrival timing:
|
|
|
|
```bash
|
|
uv run agentic-pd-hybrid benchmark-live \
|
|
--trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
|
|
--output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
|
|
--mechanism kvcache-centric \
|
|
--policy kv-aware \
|
|
--kvcache-admission-mode worker \
|
|
--prefill-workers 1 \
|
|
--decode-workers 1 \
|
|
--prefill-gpu-ids 0 \
|
|
--decode-gpu-ids 1 \
|
|
--transfer-backend mooncake \
|
|
--target-duration-s 2000 \
|
|
--session-sample-rate 1.0 \
|
|
--min-turns 2 \
|
|
--time-scale 1 \
|
|
--concurrency-limit 1000
|
|
```
|
|
|
|
Generate a 30k input, 1k append, 256 output small-append trace:
|
|
|
|
```bash
|
|
uv run agentic-pd-hybrid make-small-append-trace \
|
|
--output outputs/smoke-hotcap-30k-1k-256.jsonl \
|
|
--session-count 4 \
|
|
--turns-per-session 3 \
|
|
--initial-input-length 30000 \
|
|
--append-input-length 1000 \
|
|
--output-length 256
|
|
```
|
|
|
|
## Known Limits
|
|
|
|
- This is not production routing code.
|
|
- The current evaluation is single-node and constrained by `prefill + decode <=
|
|
8` GPUs.
|
|
- Trace prompts are synthetic because the Ali trace used here contains lengths
|
|
and `hash_ids`, not raw prompts.
|
|
- KV-cache-centric admission still needs better hot-session prediction. The next
|
|
useful step is inter-turn-gap-aware admission and aging, so D cache is held
|
|
only for sessions likely to reuse it soon.
|