# Project Overview This repository is a minimal research prototype for evaluating whether session-aware and KV-cache-aware prefill/decode routing can improve end-to-end latency for agentic coding workloads on top of SGLang xPyD. The current target environment is a single 8-GPU node running SGLang `v0.5.10` with Qwen3-Coder-30B-A3B-Instruct. The local setup keeps the P -> D transfer path through SGLang disaggregation and Mooncake loopback instead of replacing it with an in-process shortcut. ## Design The code keeps policy separate from mechanism. - Mechanism code launches SGLang workers, sends requests, manages streaming sessions, and records request-level metrics. - Policy code decides which prefill worker and decode worker should receive a request. - Replay and benchmark code preserve trace arrival times unless explicitly configured otherwise, so concurrency comes from the workload shape rather than from an artificial fixed-concurrency driver. The main comparison points are: - `pd-disaggregation`: normal router-managed P/D serving. - `kvcache-centric`: worker/router assisted session-aware routing that can keep a decode streaming session resident and send later small appends directly to D. - `pd-colo`: direct colocated serving baseline for experiments that do not use the P/D router path. ## Implemented The prototype currently includes: - One-node P/D launch planning and managed stack lifecycle. - A lightweight Python PD router used for live local experiments. - Ali trace loading, session-granularity sampling, and synthetic prompt generation from `hash_ids`. - Trace replay with natural pacing, request dependencies inside a session, and request-level metrics JSONL plus summary JSON. - Routing policies: - `default`: simple baseline placement. - `sticky`: turn2+ prefers the previous D node for the same session. - `kv-aware`: uses observed block overlap/session state to choose D placement. - Live benchmark orchestration through `benchmark-live`. - Small-append synthetic trace generation for micro-benchmarks. - KV-cache-centric worker admission modes: - router shadow-state admission. - worker queried admission. - session-level D residency soft cap for worker-managed admission, so only a small hot set is kept as decode streaming sessions while the rest fall back to normal PD routing. - P-side prefill backup bookkeeping for experiments where D evictions can retain a lower-priority copy on P. - Fail-fast handling for empty streaming responses and a shorter SGLang disaggregation wait timeout to avoid treating transfer hangs as successful long-tail responses. ## Current Findings The micro-benchmark can make KV-cache-centric routing look better than `pd-disaggregation` because the active sessions fit in D KV cache. Later turns can then bypass P and use `kvcache-direct-to-d-session`, reducing TTFT. On the larger 316-request, variable-turn workload, there are 58 sessions and the working set is larger than the useful D residency budget. A naive worker-managed KV-cache-centric policy repeatedly evicts and reseeds whole sessions, adding TTFT and transfer pressure. Aggressive P-backup also increases tail risk when it keeps too much state around. The current soft-cap optimization improves worker-managed KV-cache-centric relative to the older worker-managed path, but `pd-disaggregation` is still slightly better on the sampled Ali workload because most requests fall back to normal PD routing while a few retained D sessions still consume token budget. ## Useful Commands Run a live benchmark with natural arrival timing: ```bash uv run agentic-pd-hybrid benchmark-live \ --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \ --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \ --mechanism kvcache-centric \ --policy kv-aware \ --kvcache-admission-mode worker \ --prefill-workers 1 \ --decode-workers 1 \ --prefill-gpu-ids 0 \ --decode-gpu-ids 1 \ --transfer-backend mooncake \ --target-duration-s 2000 \ --session-sample-rate 1.0 \ --min-turns 2 \ --time-scale 1 \ --concurrency-limit 1000 ``` Generate a 30k input, 1k append, 256 output small-append trace: ```bash uv run agentic-pd-hybrid make-small-append-trace \ --output outputs/smoke-hotcap-30k-1k-256.jsonl \ --session-count 4 \ --turns-per-session 3 \ --initial-input-length 30000 \ --append-input-length 1000 \ --output-length 256 ``` ## Known Limits - This is not production routing code. - The current evaluation is single-node and constrained by `prefill + decode <= 8` GPUs. - Trace prompts are synthetic because the Ali trace used here contains lengths and `hash_ids`, not raw prompts. - KV-cache-centric admission still needs better hot-session prediction. The next useful step is inter-turn-gap-aware admission and aging, so D cache is held only for sessions likely to reuse it soon.