Files
agentic-pd-hybrid/docs/PROJECT_OVERVIEW.md

5.7 KiB

Project Overview

This repository is a minimal research prototype for evaluating whether session-aware and KV-cache-aware prefill/decode routing can improve end-to-end latency for agentic coding workloads on top of SGLang xPyD.

The current target environment is a single 8-GPU node running SGLang v0.5.10 with Qwen3-Coder-30B-A3B-Instruct. The repo vendors SGLang under third_party/sglang so our xPyD/session-cache changes are maintained together with the benchmark harness. The local setup keeps the P -> D transfer path through SGLang disaggregation and Mooncake loopback instead of replacing it with an in-process shortcut.

Design

The code keeps policy separate from mechanism.

  • Mechanism code launches SGLang workers, sends requests, manages streaming sessions, and records request-level metrics.
  • Policy code decides which prefill worker and decode worker should receive a request.
  • Replay and benchmark code preserve trace arrival times unless explicitly configured otherwise, so concurrency comes from the workload shape rather than from an artificial fixed-concurrency driver.

The main comparison points are:

  • pd-disaggregation: normal router-managed P/D serving.
  • kvcache-centric: worker/router assisted session-aware routing that can keep a decode streaming session resident and send later small appends directly to D.
  • pd-colo: direct colocated serving baseline for experiments that do not use the P/D router path.

Implemented

The prototype currently includes:

  • One-node P/D launch planning and managed stack lifecycle.
  • A lightweight Python PD router used for live local experiments.
  • Ali trace loading, session-granularity sampling, and synthetic prompt generation from hash_ids.
  • Trace replay with natural pacing, request dependencies inside a session, and request-level metrics JSONL plus summary JSON.
  • Routing policies:
    • default: simple baseline placement.
    • sticky: turn2+ prefers the previous D node for the same session.
    • kv-aware: uses observed block overlap/session state to choose D placement.
  • Live benchmark orchestration through benchmark-live.
  • Small-append synthetic trace generation for micro-benchmarks.
  • KV-cache-centric worker admission modes:
    • router shadow-state admission.
    • worker queried admission.
    • session-level D residency soft cap for worker-managed admission, so only a small hot set is kept as decode streaming sessions while the rest fall back to normal PD routing.
  • P-side prefill backup bookkeeping for experiments where D evictions can retain a lower-priority copy on P.
  • Fail-fast handling for empty streaming responses and a shorter SGLang disaggregation wait timeout to avoid treating transfer hangs as successful long-tail responses.

SGLang Maintenance

SGLang is tracked directly in this repository:

  • chore: vendor sglang v0.5.10 snapshot records the clean upstream baseline.
  • Later feat(sglang): ... / fix(sglang): ... commits should contain only local SGLang changes.
  • Generated files such as __pycache__ and benchmark outputs stay ignored.

The current SGLang patch adds the worker-side mechanisms needed by KV-cache-centric experiments:

  • decode workers can optionally accept local append-prefill requests in PD mode;
  • streaming session cache status is exposed for router/admission decisions;
  • idle streaming sessions can be evicted at session granularity;
  • direct append admission can check resident session state and D token pressure before the replay path bypasses P.

Current Findings

The micro-benchmark can make KV-cache-centric routing look better than pd-disaggregation because the active sessions fit in D KV cache. Later turns can then bypass P and use kvcache-direct-to-d-session, reducing TTFT.

On the larger 316-request, variable-turn workload, there are 58 sessions and the working set is larger than the useful D residency budget. A naive worker-managed KV-cache-centric policy repeatedly evicts and reseeds whole sessions, adding TTFT and transfer pressure. Aggressive P-backup also increases tail risk when it keeps too much state around.

The current soft-cap optimization improves worker-managed KV-cache-centric relative to the older worker-managed path, but pd-disaggregation is still slightly better on the sampled Ali workload because most requests fall back to normal PD routing while a few retained D sessions still consume token budget.

Useful Commands

Run a live benchmark with natural arrival timing:

uv run agentic-pd-hybrid benchmark-live \
  --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
  --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
  --mechanism kvcache-centric \
  --policy kv-aware \
  --kvcache-admission-mode worker \
  --prefill-workers 1 \
  --decode-workers 1 \
  --prefill-gpu-ids 0 \
  --decode-gpu-ids 1 \
  --transfer-backend mooncake \
  --target-duration-s 2000 \
  --session-sample-rate 1.0 \
  --min-turns 2 \
  --time-scale 1 \
  --concurrency-limit 1000

Generate a 30k input, 1k append, 256 output small-append trace:

uv run agentic-pd-hybrid make-small-append-trace \
  --output outputs/smoke-hotcap-30k-1k-256.jsonl \
  --session-count 4 \
  --turns-per-session 3 \
  --initial-input-length 30000 \
  --append-input-length 1000 \
  --output-length 256

Known Limits

  • This is not production routing code.
  • The current evaluation is single-node and constrained by prefill + decode <= 8 GPUs.
  • Trace prompts are synthetic because the Ali trace used here contains lengths and hash_ids, not raw prompts.
  • KV-cache-centric admission still needs better hot-session prediction. The next useful step is inter-turn-gap-aware admission and aging, so D cache is held only for sessions likely to reuse it soon.