diff --git a/analysis/agentic_pd_unified_story_plan.md b/analysis/agentic_pd_unified_story_plan.md new file mode 100644 index 0000000..68c1328 --- /dev/null +++ b/analysis/agentic_pd_unified_story_plan.md @@ -0,0 +1,552 @@ +# Agentic PD / Unified Routing Story Plan + +Status: draft for review +Date: 2026-05-25 + +## 0. Goal + +This document aligns three threads: + +1. `agentic-kv`: vLLM-based PD-colocation, full PD separation, LMetric, + Unified routing, and elastic migration experiments. +2. `dash0:/home/admin/cpfs/wjh/agentic-kv`: run artifacts and the + latest PD-separation paper-section scaffold. +3. `~/phd/agentic-pd-hybrid`: SGLang/PPD/KVC experiments, including + retractions and stricter framing around loadgen validity. + +The purpose is to converge on a defensible story and a concrete task plan, +not to force the old Unified routing hypothesis to be true. + +## 1. Current Best Framing + +### 1.1 Workload premise + +Agentic serving is not chatbot serving. + +- Requests have long input contexts and short outputs. +- Most reusable KV is intra-session, not cross-session. +- Sessions are multi-turn and causally sequential: turn N+1 cannot be + faithfully issued before turn N finishes. +- Long-lived sessions create two competing needs: + - keep cache locality for future turns; + - avoid pinning all future work to an overloaded worker. + +This workload makes cache locality a first-order system objective, but +also makes naive session pinning dangerous. + +### 1.2 System premise + +PD separation is not a universally good abstraction. It helps only when: + +``` +saved decode interference > KV transfer + P queue + D queue + KV capacity cost +``` + +For agentic workloads, that inequality often fails because long-context KV +is large and decode-side KV residency becomes the limiting resource. + +### 1.3 Main thesis candidate + +Static PD separation is the wrong default for single-node agentic serving. +The stronger baseline is PD-colocation with cache-aware routing. The +interesting open problem is not "separate prefill and decode everywhere", +but: + +> how to preserve session-level KV locality while retaining enough routing +> freedom to avoid hot-worker queueing and decode interference. + +Unified routing should be framed as an attempt at that problem. The current +experiments show that the migration actuator was too expensive, so the +story should distinguish the principle from the failed mechanism. + +## 2. What We Should Align Across Repos + +### 2.1 Naming / architecture mapping + +Use one taxonomy consistently: + +| Name in paper/story | vLLM repo term | SGLang/KVC repo term | Meaning | +|---|---|---|---| +| Replica / PD-colo | combined / PD-colocated | `pd_colo`, SGLang `cache_aware` | all workers do prefill + decode | +| x=0 PD-disagg | full PD separation | `pd_disagg` | every turn goes P then D | +| x=1 / append-prefill-on-D | not implemented as such in vLLM experiments | KVC / PPD-style direct-to-D | turn 1 seeds D; later turns prefill locally on D | +| Elastic migration | Unified PUSH / elastic offload | smart migration / re-pin sessions | move a session or a request away from overloaded worker | +| Hybrid routing | current Unified baseline | PD-colo + soft pin / kv-aware | cache-aware LB plus explicit affinity only when worth it | + +Important distinction: vLLM Unified PUSH migration is not the same as PPD +x=1. Unified PUSH still pays cross-instance KV movement for migrated +requests. PPD x=1 tries to avoid P-to-D transfer on later turns by doing +append-prefill directly on the resident D node. + +### 2.2 Results that look stable + +These are safe to build around: + +1. Full/static PD separation is weak for agentic on one node. + - vLLM evidence: decode-side KV memory wall and transfer overhead. + - SGLang evidence: x=0 PD-disagg is consistently worse than PD-colo. + +2. LMetric/cache-aware routing is a strong baseline. + - In vLLM, corrected LMetric nearly matches session-sticky because + `new_tokens = input - cached` creates implicit soft affinity. + - In SGLang, `cache_aware` is the production-stable baseline and often + wins or ties. + +3. Explicit session pinning is not automatically good. + - It can improve cache hit rate. + - It can create head-of-line blocking if sessions grow unevenly. + - Initial placement matters; mid-session migration is desirable in + principle but hard in practice. + +4. Transfer-based migration is currently too expensive. + - vLLM experiments: forced/relaxed migration worsened E2E tail. + - Mooncake path lacks enough overlap/layerwise benefit in current setup. + +5. Loadgen validity must be treated as substrate, not detail. + - `agentic-pd-hybrid` explicitly retracted high-concurrency claims + where session sequentiality was violated. + - Future experiments must enforce per-session causality. + +### 2.3 Results that need more careful wording + +1. "Unified beats LMetric" should not be stated as a strong result yet. + The latest stable implementation is closer to LMetric plus a high-cache + affinity gate. Expected gain is small by design. + +2. "PD separation is always bad" is too broad. + The correct claim is conditional: static/full PD separation is net + negative in the long-context, high-KV-footprint, single-node agentic + regime we measured. + +3. "KVC/PPD wins" is not established for our stack. + The SGLang repo contains useful PPD framing, but also several retractions: + high-concurrency wins were affected by loadgen issues, and KVC stability + was not production-ready in some runs. + +4. "Session migration will fix load balance" is still a hypothesis. + It is valid as a first-principles goal, but the tested vLLM actuator + did not satisfy the cost budget. + +## 3. Proposed Storytelling Outline + +### Section A: Why agentic serving is different + +Claim: + +- Agentic workloads combine long contexts, high intra-session reuse, and + sequential multi-turn sessions. +- This makes KV cache lifecycle and routing more important than the classic + prefill/decode kernel dichotomy. + +Evidence to use: + +- Input/output token CDF. +- KV reuse decomposition: intra-session vs cross-session. +- Session length / context growth examples. +- Per-session sequentiality requirement. + +Needed cleanup: + +- Use one trace definition and report sampling method. +- Explicitly state whether a trace is online-realistic, synthetic burst, + or stress test. + +### Section B: Why static PD separation fails + +Claim: + +- The classic roofline premise is true but insufficient. +- Prefill can be compute-bound while static PD separation still loses at + the system level. + +Mechanism: + +- PD separation relocates prefill; it does not reduce total prefill work. +- It adds KV transfer. +- It concentrates decode KV residency onto fewer D GPUs. +- Long-context agentic requests hit a decode-side KV memory wall. + +Evidence to use: + +- `analysis/pd_sep_paper_section/system_analysis.md` +- C1 workload figures. +- C6 roofline figure. +- KV memory wall model. +- Fresh PD matrix once rerun without forced eager mode. + +Task implication: + +- Complete C2/C3/C4/C5 matrix before making this a paper-grade section. + +### Section C: Why cache-aware PD-colo is hard to beat + +Claim: + +- Cache-aware routing already captures much of the desired session affinity. +- LMetric's cache-adjusted prefill cost gives implicit soft affinity without + hard pinning. + +Mechanism: + +- A worker with cached prefix has lower `new_tokens`. +- This naturally attracts later turns unless the worker is sufficiently + loaded. +- This is exactly the balance we want: preserve locality while retaining + routing freedom. + +Evidence to use: + +- Corrected LMetric vs Linear comparison. +- APC distribution. +- PD-colo stability from SGLang/KVC repo. + +Task implication: + +- Treat LMetric/cache-aware PD-colo as the primary baseline, not round-robin + or naive sticky. + +### Section D: Why Unified migration did not improve over LMetric + +Claim: + +- Unified's principle was right, but the migration mechanism failed the + cost budget. + +Mechanism: + +- At conservative gates, too few requests migrate to change load balance. +- At relaxed gates, migration overhead dominates. +- Cold/heavy requests often cannot benefit from source cache and remain + colocated. +- Cached migration still pays P-side queue, KV movement, and D admission. +- The cost model initially underestimated cache-attraction feedback and + queue effects. + +Evidence to use: + +- Git history: single argmin -> soft affinity -> decode load/hard gate -> + forced migration -> revert -> hybrid LMetric. +- Approach B / relaxed gate regressions. +- 16-session contention: interference exists, but elastic RDMA made TPOT + worse and offloaded too few requests. + +Task implication: + +- Do not revive three-way argmin or aggressive PUSH migration. +- Frame current Unified as hybrid LMetric plus selective affinity. + +### Section E: What remains promising + +There are two different future paths. They should not be conflated. + +Path 1: Conservative, vLLM-ready. + +- Stay PD-colocated. +- Use corrected LMetric as base. +- Add only explicit high-cache affinity / tie-break logic where it improves + stability. +- Improve scheduling: adaptive chunked prefill, decode-priority controls, + better observability of queue and cache state. + +Path 2: Research, PPD-style. + +- Turn 1 seeds session on D. +- Later turns do append-prefill on resident D, avoiding P-to-D transfer. +- Dynamic x chooses P vs D based on append size, P queue, D load, and SLO. +- Requires stable implementation and strict loadgen validation. + +The paper/story can say: transfer-based migration did not work; append- +prefill-on-resident-D remains a different and potentially better actuator. + +## 4. Design Direction Recommendation + +### 4.1 Near-term path + +Use PD-colo cache-aware as the production baseline and paper baseline. + +Implement/validate only low-risk routing improvements: + +1. Pure LMetric baseline must stay separate and reproducible. +2. Unified hybrid should be LMetric plus: + - high-cache explicit affinity; + - overload escape; + - deterministic non-degenerate tie-break; + - route-decision logging. +3. No Mooncake/PUSH migration on the critical comparison path. + +This gives a clean statement: + +> The best robust single-node policy we have is cache-aware PD-colocation. +> Unified hybrid is a small refinement, not a new disaggregation win. + +### 4.2 Research path + +If we want a stronger contribution beyond "PD-sep loses", the promising +research direction is: + +> session-resident append-prefill with dynamic P/D selection. + +This aligns better with PPD than vLLM PUSH migration does. + +Key design principle: + +- Do not move KV just to run prefill elsewhere unless the future benefit is + large enough to amortize the transfer. +- Prefer using the worker that already owns the session KV, unless decode + load or append size makes that choice violate SLO. + +## 5. Experiment Plan + +### 5.1 Must-have validity checks + +For every benchmark: + +- Per-session sequentiality enforced. +- Attempted/completed/error counts reported. +- Pair by `(session_id, turn_id)` when comparing arms. +- Report goodput, not only latency of successes. +- Record git commit, launch flags, trace path, request limit, time scale, + session sampling method, and hardware. + +### 5.2 PD separation matrix + +Goal: make the static PD-sep negative result paper-grade. + +Arms: + +- PD-colo cache-aware. +- PD-sep 4P+4D. +- PD-sep 6P+2D. +- Optional: round-robin baseline only as sanity, not main comparison. +- Optional: eager vs cudagraph ablation. + +Metrics: + +- TTFT/E2E/TPOT p50/p90/p99. +- Goodput and error rate. +- APC mean and per-instance distribution. +- GPU util and decode-side KV occupancy time series. +- TTFT breakdown: prefill, KV transfer, D wait. + +Output: + +- C2 headline bar with error bars. +- C3 KV utilization time series. +- C4 TTFT stacked breakdown. +- C5 cuda-graph ablation. + +### 5.3 LMetric vs Unified hybrid + +Goal: determine whether current Unified has any real gain over LMetric. + +Arms: + +- Pure corrected LMetric. +- Current Unified hybrid. + +Run: + +- 3-5 paired trials on the same trace. +- No Mooncake/PUSH. +- Same launch flags. + +Additional logging: + +- Route reason: `lmetric`, `high_cache_affinity`, `overload_escape`, + `tie_break`. +- Chosen instance load, cache hit, effective new tokens. + +Decision rule: + +- If gain is within noise, do not oversell Unified as a performance win. + Keep it as a policy cleanup / safety improvement. + +### 5.4 Interference and scheduler experiments + +Goal: test whether scheduling is the right actuator after routing saturates. + +Arms: + +- Different chunked prefill sizes. +- Decode-priority / prefill throttling if available. +- High-concurrency but session-sequential trace. + +Metrics: + +- TPOT under concurrent heavy prefills. +- TTFT for heavy turns. +- Decode queue delay. +- GPU util timeline. + +Expected value: + +- If migration is too expensive, reducing prefill interference in-place is + the most plausible next improvement. + +### 5.5 PPD/KVC-style research validation + +Goal: separate PPD x=1 from failed x=0/full PD and failed transfer-based +migration. + +Arms: + +- PD-colo cache-aware. +- x=0 PD-disagg. +- x=1 append-prefill-on-D if implementation is stable. +- Dynamic x if available. + +Guardrails: + +- Do not use old high-concurrency KVC numbers without the loadgen caveat. +- Do not compare partial successful subsets without goodput. +- Treat SGLang implementation bugs as system results, not hidden noise. + +## 6. Task Breakdown + +### Track 1: Documentation alignment + +Owner task: + +- Update `REPORT.md`, `docs/migration-policy-design.md`, and + `analysis/research_findings.md` so they use the taxonomy in section 2. + +Concrete edits: + +- Mark single-argmin/PUSH Unified as historical. +- State that current Unified is hybrid LMetric plus high-cache affinity. +- Add mapping to PPD taxonomy: Replica, x=0 PD, x=1 append-prefill. +- Add loadgen validity checklist. + +Done when: + +- A reviewer can no longer confuse vLLM PUSH migration with PPD x=1. +- LMetric baseline and Unified hybrid are described as separate policies. + +### Track 2: Current routing cleanup + +Owner task: + +- Make current Unified hybrid auditable and minimal. + +Concrete edits: + +- Remove stale unreachable PUSH code from `scripts/cache_aware_proxy.py`. +- Keep pure `--policy lmetric` untouched. +- Add route-decision fields for Unified hybrid. +- Add tests: + - pure LMetric remains pure; + - high-cache affinity triggers only under its intended gate; + - overload escape works; + - empty-batch tie-break does not collapse to instance 0. + +Done when: + +- `pytest tests/test_proxy_pick.py` covers LMetric and Unified separately. +- Bench logs can count how often Unified did something beyond LMetric. + +### Track 3: PD-sep paper matrix + +Owner task: + +- Finish the `analysis/pd_sep_paper_section` missing claims. + +Concrete work: + +- Run `bench_pd_matrix.sh` on dash0. +- Collect `metrics.summary.json`, `breakdown.json`, `apc.txt`, + `gpu_util.csv`, and per-instance KV logs. +- Add plotters for C2/C3/C4/C5. +- Replace legacy C7 numbers with matrix outputs. + +Done when: + +- The PD-sep negative result no longer relies on old `--enforce-eager` + methodology or single snapshots. + +### Track 4: Benchmark substrate validation + +Owner task: + +- Audit the vLLM replayer and any dash0 loadgen scripts for session + sequentiality and arrival semantics. + +Concrete checks: + +- Verify no session has more than one in-flight turn unless explicitly + configured as a stress test. +- Add an analyzer that reports max concurrent turns per session. +- Report sampled session-start distribution. +- Add goodput and error-rate comparisons to all summary scripts. + +Done when: + +- We can label each experiment as online-realistic, burst stress, or + synthetic microbench. + +### Track 5: Scheduler/interference path + +Owner task: + +- Test whether in-place scheduling beats transfer-based migration. + +Concrete experiments: + +- Chunk size sweep. +- Decode-priority or prefill-throttle sweep. +- 16+ session sequential replay. + +Done when: + +- We know whether the next performance lever is scheduler policy or routing + policy. + +### Track 6: PPD-style appendix / related design + +Owner task: + +- Extract the useful `agentic-pd-hybrid` lessons without importing invalid + claims. + +Concrete work: + +- Summarize: + - loadgen bug and retractions; + - PD-colo as stable baseline; + - x=0 PD-disagg failure; + - x=1/append-prefill motivation; + - dynamic threshold lessons. +- Decide whether this is mainline future work or an appendix framing. + +Done when: + +- The story can cite PPD-style append-prefill as a distinct future actuator, + not as evidence that the current Unified migration already works. + +## 7. Proposed One-Sentence Story + +Agentic serving breaks the classic PD-disaggregation intuition: long-lived +sessions make KV locality dominant, while long contexts make decode-side KV +capacity and transfer costs dominate the gains from isolating prefill; the +robust design is cache-aware PD-colocation with carefully limited session +affinity, and future disaggregation must be dynamic and session-resident +rather than static or transfer-heavy. + +## 8. Open Decisions For Review + +1. Do we want the main paper contribution to be the negative result + "static PD separation fails for agentic", or the positive system + "cache-aware PD-colo / Unified hybrid"? + +2. Is PPD-style x=1 append-prefill a future-work section, or do we need to + implement a minimal stable version before finalizing the story? + +3. Should current Unified be presented as a named system if its measured + improvement over LMetric is small, or should it be framed as an audit of + why LMetric/cache-aware is already strong? + +4. Which trace is the canonical trace for claims: the vLLM trace in + `agentic-kv`, the GLM-5.1 trace in `agentic-pd-hybrid`, or both with + explicit regime labels? + +5. What is the target venue-style claim: systems negative result, + workload characterization, or routing/scheduling algorithm?