Agentic PD / Unified routing story plan draft

User's 2026-05-25 draft aligning three threads (agentic-kv vLLM experiments, dash0 artifacts, agentic-pd-hybrid SGLang work) into a single story for the paper. Tracked so future iterations and review history are in version control. Co-Authored-By: Gahow Wang <chiahaco@gmail.com>
2026-05-26 01:12:42 +08:00
parent 0881942cf3
commit c63dc151a0
1 changed files with 552 additions and 0 deletions
--- a/analysis/agentic_pd_unified_story_plan.md
+++ b/analysis/agentic_pd_unified_story_plan.md
@@ -0,0 +1,552 @@
 # Agentic PD / Unified Routing Story Plan
 Status: draft for review
 Date: 2026-05-25
 ## 0. Goal
 This document aligns three threads:
 1. `agentic-kv`: vLLM-based PD-colocation, full PD separation, LMetric,
   Unified routing, and elastic migration experiments.
 2. `dash0:/home/admin/cpfs/wjh/agentic-kv`: run artifacts and the
   latest PD-separation paper-section scaffold.
 3. `~/phd/agentic-pd-hybrid`: SGLang/PPD/KVC experiments, including
   retractions and stricter framing around loadgen validity.
 The purpose is to converge on a defensible story and a concrete task plan,
 not to force the old Unified routing hypothesis to be true.
 ## 1. Current Best Framing
 ### 1.1 Workload premise
 Agentic serving is not chatbot serving.
 - Requests have long input contexts and short outputs.
 - Most reusable KV is intra-session, not cross-session.
 - Sessions are multi-turn and causally sequential: turn N+1 cannot be
  faithfully issued before turn N finishes.
 - Long-lived sessions create two competing needs:
  - keep cache locality for future turns;
  - avoid pinning all future work to an overloaded worker.
 This workload makes cache locality a first-order system objective, but
 also makes naive session pinning dangerous.
 ### 1.2 System premise
 PD separation is not a universally good abstraction. It helps only when:
 ```
 saved decode interference > KV transfer + P queue + D queue + KV capacity cost
 ```
 For agentic workloads, that inequality often fails because long-context KV
 is large and decode-side KV residency becomes the limiting resource.
 ### 1.3 Main thesis candidate
 Static PD separation is the wrong default for single-node agentic serving.
 The stronger baseline is PD-colocation with cache-aware routing. The
 interesting open problem is not "separate prefill and decode everywhere",
 but:
 > how to preserve session-level KV locality while retaining enough routing
 > freedom to avoid hot-worker queueing and decode interference.
 Unified routing should be framed as an attempt at that problem. The current
 experiments show that the migration actuator was too expensive, so the
 story should distinguish the principle from the failed mechanism.
 ## 2. What We Should Align Across Repos
 ### 2.1 Naming / architecture mapping
 Use one taxonomy consistently:
 | Name in paper/story | vLLM repo term | SGLang/KVC repo term | Meaning |
 |---|---|---|---|
 | Replica / PD-colo | combined / PD-colocated | `pd_colo`, SGLang `cache_aware` | all workers do prefill + decode |
 | x=0 PD-disagg | full PD separation | `pd_disagg` | every turn goes P then D |
 | x=1 / append-prefill-on-D | not implemented as such in vLLM experiments | KVC / PPD-style direct-to-D | turn 1 seeds D; later turns prefill locally on D |
 | Elastic migration | Unified PUSH / elastic offload | smart migration / re-pin sessions | move a session or a request away from overloaded worker |
 | Hybrid routing | current Unified baseline | PD-colo + soft pin / kv-aware | cache-aware LB plus explicit affinity only when worth it |
 Important distinction: vLLM Unified PUSH migration is not the same as PPD
 x=1. Unified PUSH still pays cross-instance KV movement for migrated
 requests. PPD x=1 tries to avoid P-to-D transfer on later turns by doing
 append-prefill directly on the resident D node.
 ### 2.2 Results that look stable
 These are safe to build around:
 1. Full/static PD separation is weak for agentic on one node.
   - vLLM evidence: decode-side KV memory wall and transfer overhead.
   - SGLang evidence: x=0 PD-disagg is consistently worse than PD-colo.
 2. LMetric/cache-aware routing is a strong baseline.
   - In vLLM, corrected LMetric nearly matches session-sticky because
     `new_tokens = input - cached` creates implicit soft affinity.
   - In SGLang, `cache_aware` is the production-stable baseline and often
     wins or ties.
 3. Explicit session pinning is not automatically good.
   - It can improve cache hit rate.
   - It can create head-of-line blocking if sessions grow unevenly.
   - Initial placement matters; mid-session migration is desirable in
     principle but hard in practice.
 4. Transfer-based migration is currently too expensive.
   - vLLM experiments: forced/relaxed migration worsened E2E tail.
   - Mooncake path lacks enough overlap/layerwise benefit in current setup.
 5. Loadgen validity must be treated as substrate, not detail.
   - `agentic-pd-hybrid` explicitly retracted high-concurrency claims
     where session sequentiality was violated.
   - Future experiments must enforce per-session causality.
 ### 2.3 Results that need more careful wording
 1. "Unified beats LMetric" should not be stated as a strong result yet.
   The latest stable implementation is closer to LMetric plus a high-cache
   affinity gate. Expected gain is small by design.
 2. "PD separation is always bad" is too broad.
   The correct claim is conditional: static/full PD separation is net
   negative in the long-context, high-KV-footprint, single-node agentic
   regime we measured.
 3. "KVC/PPD wins" is not established for our stack.
   The SGLang repo contains useful PPD framing, but also several retractions:
   high-concurrency wins were affected by loadgen issues, and KVC stability
   was not production-ready in some runs.
 4. "Session migration will fix load balance" is still a hypothesis.
   It is valid as a first-principles goal, but the tested vLLM actuator
   did not satisfy the cost budget.
 ## 3. Proposed Storytelling Outline
 ### Section A: Why agentic serving is different
 Claim:
 - Agentic workloads combine long contexts, high intra-session reuse, and
  sequential multi-turn sessions.
 - This makes KV cache lifecycle and routing more important than the classic
  prefill/decode kernel dichotomy.
 Evidence to use:
 - Input/output token CDF.
 - KV reuse decomposition: intra-session vs cross-session.
 - Session length / context growth examples.
 - Per-session sequentiality requirement.
 Needed cleanup:
 - Use one trace definition and report sampling method.
 - Explicitly state whether a trace is online-realistic, synthetic burst,
  or stress test.
 ### Section B: Why static PD separation fails
 Claim:
 - The classic roofline premise is true but insufficient.
 - Prefill can be compute-bound while static PD separation still loses at
  the system level.
 Mechanism:
 - PD separation relocates prefill; it does not reduce total prefill work.
 - It adds KV transfer.
 - It concentrates decode KV residency onto fewer D GPUs.
 - Long-context agentic requests hit a decode-side KV memory wall.
 Evidence to use:
 - `analysis/pd_sep_paper_section/system_analysis.md`
 - C1 workload figures.
 - C6 roofline figure.
 - KV memory wall model.
 - Fresh PD matrix once rerun without forced eager mode.
 Task implication:
 - Complete C2/C3/C4/C5 matrix before making this a paper-grade section.
 ### Section C: Why cache-aware PD-colo is hard to beat
 Claim:
 - Cache-aware routing already captures much of the desired session affinity.
 - LMetric's cache-adjusted prefill cost gives implicit soft affinity without
  hard pinning.
 Mechanism:
 - A worker with cached prefix has lower `new_tokens`.
 - This naturally attracts later turns unless the worker is sufficiently
  loaded.
 - This is exactly the balance we want: preserve locality while retaining
  routing freedom.
 Evidence to use:
 - Corrected LMetric vs Linear comparison.
 - APC distribution.
 - PD-colo stability from SGLang/KVC repo.
 Task implication:
 - Treat LMetric/cache-aware PD-colo as the primary baseline, not round-robin
  or naive sticky.
 ### Section D: Why Unified migration did not improve over LMetric
 Claim:
 - Unified's principle was right, but the migration mechanism failed the
  cost budget.
 Mechanism:
 - At conservative gates, too few requests migrate to change load balance.
 - At relaxed gates, migration overhead dominates.
 - Cold/heavy requests often cannot benefit from source cache and remain
  colocated.
 - Cached migration still pays P-side queue, KV movement, and D admission.
 - The cost model initially underestimated cache-attraction feedback and
  queue effects.
 Evidence to use:
 - Git history: single argmin -> soft affinity -> decode load/hard gate ->
  forced migration -> revert -> hybrid LMetric.
 - Approach B / relaxed gate regressions.
 - 16-session contention: interference exists, but elastic RDMA made TPOT
  worse and offloaded too few requests.
 Task implication:
 - Do not revive three-way argmin or aggressive PUSH migration.
 - Frame current Unified as hybrid LMetric plus selective affinity.
 ### Section E: What remains promising
 There are two different future paths. They should not be conflated.
 Path 1: Conservative, vLLM-ready.
 - Stay PD-colocated.
 - Use corrected LMetric as base.
 - Add only explicit high-cache affinity / tie-break logic where it improves
  stability.
 - Improve scheduling: adaptive chunked prefill, decode-priority controls,
  better observability of queue and cache state.
 Path 2: Research, PPD-style.
 - Turn 1 seeds session on D.
 - Later turns do append-prefill on resident D, avoiding P-to-D transfer.
 - Dynamic x chooses P vs D based on append size, P queue, D load, and SLO.
 - Requires stable implementation and strict loadgen validation.
 The paper/story can say: transfer-based migration did not work; append-
 prefill-on-resident-D remains a different and potentially better actuator.
 ## 4. Design Direction Recommendation
 ### 4.1 Near-term path
 Use PD-colo cache-aware as the production baseline and paper baseline.
 Implement/validate only low-risk routing improvements:
 1. Pure LMetric baseline must stay separate and reproducible.
 2. Unified hybrid should be LMetric plus:
   - high-cache explicit affinity;
   - overload escape;
   - deterministic non-degenerate tie-break;
   - route-decision logging.
 3. No Mooncake/PUSH migration on the critical comparison path.
 This gives a clean statement:
 > The best robust single-node policy we have is cache-aware PD-colocation.
 > Unified hybrid is a small refinement, not a new disaggregation win.
 ### 4.2 Research path
 If we want a stronger contribution beyond "PD-sep loses", the promising
 research direction is:
 > session-resident append-prefill with dynamic P/D selection.
 This aligns better with PPD than vLLM PUSH migration does.
 Key design principle:
 - Do not move KV just to run prefill elsewhere unless the future benefit is
  large enough to amortize the transfer.
 - Prefer using the worker that already owns the session KV, unless decode
  load or append size makes that choice violate SLO.
 ## 5. Experiment Plan
 ### 5.1 Must-have validity checks
 For every benchmark:
 - Per-session sequentiality enforced.
 - Attempted/completed/error counts reported.
 - Pair by `(session_id, turn_id)` when comparing arms.
 - Report goodput, not only latency of successes.
 - Record git commit, launch flags, trace path, request limit, time scale,
  session sampling method, and hardware.
 ### 5.2 PD separation matrix
 Goal: make the static PD-sep negative result paper-grade.
 Arms:
 - PD-colo cache-aware.
 - PD-sep 4P+4D.
 - PD-sep 6P+2D.
 - Optional: round-robin baseline only as sanity, not main comparison.
 - Optional: eager vs cudagraph ablation.
 Metrics:
 - TTFT/E2E/TPOT p50/p90/p99.
 - Goodput and error rate.
 - APC mean and per-instance distribution.
 - GPU util and decode-side KV occupancy time series.
 - TTFT breakdown: prefill, KV transfer, D wait.
 Output:
 - C2 headline bar with error bars.
 - C3 KV utilization time series.
 - C4 TTFT stacked breakdown.
 - C5 cuda-graph ablation.
 ### 5.3 LMetric vs Unified hybrid
 Goal: determine whether current Unified has any real gain over LMetric.
 Arms:
 - Pure corrected LMetric.
 - Current Unified hybrid.
 Run:
 - 3-5 paired trials on the same trace.
 - No Mooncake/PUSH.
 - Same launch flags.
 Additional logging:
 - Route reason: `lmetric`, `high_cache_affinity`, `overload_escape`,
  `tie_break`.
 - Chosen instance load, cache hit, effective new tokens.
 Decision rule:
 - If gain is within noise, do not oversell Unified as a performance win.
  Keep it as a policy cleanup / safety improvement.
 ### 5.4 Interference and scheduler experiments
 Goal: test whether scheduling is the right actuator after routing saturates.
 Arms:
 - Different chunked prefill sizes.
 - Decode-priority / prefill throttling if available.
 - High-concurrency but session-sequential trace.
 Metrics:
 - TPOT under concurrent heavy prefills.
 - TTFT for heavy turns.
 - Decode queue delay.
 - GPU util timeline.
 Expected value:
 - If migration is too expensive, reducing prefill interference in-place is
  the most plausible next improvement.
 ### 5.5 PPD/KVC-style research validation
 Goal: separate PPD x=1 from failed x=0/full PD and failed transfer-based
 migration.
 Arms:
 - PD-colo cache-aware.
 - x=0 PD-disagg.
 - x=1 append-prefill-on-D if implementation is stable.
 - Dynamic x if available.
 Guardrails:
 - Do not use old high-concurrency KVC numbers without the loadgen caveat.
 - Do not compare partial successful subsets without goodput.
 - Treat SGLang implementation bugs as system results, not hidden noise.
 ## 6. Task Breakdown
 ### Track 1: Documentation alignment
 Owner task:
 - Update `REPORT.md`, `docs/migration-policy-design.md`, and
  `analysis/research_findings.md` so they use the taxonomy in section 2.
 Concrete edits:
 - Mark single-argmin/PUSH Unified as historical.
 - State that current Unified is hybrid LMetric plus high-cache affinity.
 - Add mapping to PPD taxonomy: Replica, x=0 PD, x=1 append-prefill.
 - Add loadgen validity checklist.
 Done when:
 - A reviewer can no longer confuse vLLM PUSH migration with PPD x=1.
 - LMetric baseline and Unified hybrid are described as separate policies.
 ### Track 2: Current routing cleanup
 Owner task:
 - Make current Unified hybrid auditable and minimal.
 Concrete edits:
 - Remove stale unreachable PUSH code from `scripts/cache_aware_proxy.py`.
 - Keep pure `--policy lmetric` untouched.
 - Add route-decision fields for Unified hybrid.
 - Add tests:
  - pure LMetric remains pure;
  - high-cache affinity triggers only under its intended gate;
  - overload escape works;
  - empty-batch tie-break does not collapse to instance 0.
 Done when:
 - `pytest tests/test_proxy_pick.py` covers LMetric and Unified separately.
 - Bench logs can count how often Unified did something beyond LMetric.
 ### Track 3: PD-sep paper matrix
 Owner task:
 - Finish the `analysis/pd_sep_paper_section` missing claims.
 Concrete work:
 - Run `bench_pd_matrix.sh` on dash0.
 - Collect `metrics.summary.json`, `breakdown.json`, `apc.txt`,
  `gpu_util.csv`, and per-instance KV logs.
 - Add plotters for C2/C3/C4/C5.
 - Replace legacy C7 numbers with matrix outputs.
 Done when:
 - The PD-sep negative result no longer relies on old `--enforce-eager`
  methodology or single snapshots.
 ### Track 4: Benchmark substrate validation
 Owner task:
 - Audit the vLLM replayer and any dash0 loadgen scripts for session
  sequentiality and arrival semantics.
 Concrete checks:
 - Verify no session has more than one in-flight turn unless explicitly
  configured as a stress test.
 - Add an analyzer that reports max concurrent turns per session.
 - Report sampled session-start distribution.
 - Add goodput and error-rate comparisons to all summary scripts.
 Done when:
 - We can label each experiment as online-realistic, burst stress, or
  synthetic microbench.
 ### Track 5: Scheduler/interference path
 Owner task:
 - Test whether in-place scheduling beats transfer-based migration.
 Concrete experiments:
 - Chunk size sweep.
 - Decode-priority or prefill-throttle sweep.
 - 16+ session sequential replay.
 Done when:
 - We know whether the next performance lever is scheduler policy or routing
  policy.
 ### Track 6: PPD-style appendix / related design
 Owner task:
 - Extract the useful `agentic-pd-hybrid` lessons without importing invalid
  claims.
 Concrete work:
 - Summarize:
  - loadgen bug and retractions;
  - PD-colo as stable baseline;
  - x=0 PD-disagg failure;
  - x=1/append-prefill motivation;
  - dynamic threshold lessons.
 - Decide whether this is mainline future work or an appendix framing.
 Done when:
 - The story can cite PPD-style append-prefill as a distinct future actuator,
  not as evidence that the current Unified migration already works.
 ## 7. Proposed One-Sentence Story
 Agentic serving breaks the classic PD-disaggregation intuition: long-lived
 sessions make KV locality dominant, while long contexts make decode-side KV
 capacity and transfer costs dominate the gains from isolating prefill; the
 robust design is cache-aware PD-colocation with carefully limited session
 affinity, and future disaggregation must be dynamic and session-resident
 rather than static or transfer-heavy.
 ## 8. Open Decisions For Review
 1. Do we want the main paper contribution to be the negative result
   "static PD separation fails for agentic", or the positive system
   "cache-aware PD-colo / Unified hybrid"?
 2. Is PPD-style x=1 append-prefill a future-work section, or do we need to
   implement a minimal stable version before finalizing the story?
 3. Should current Unified be presented as a named system if its measured
   improvement over LMetric is small, or should it be framed as an audit of
   why LMetric/cache-aware is already strong?
 4. Which trace is the canonical trace for claims: the vLLM trace in
   `agentic-kv`, the GLM-5.1 trace in `agentic-pd-hybrid`, or both with
   explicit regime labels?
 5. What is the target venue-style claim: systems negative result,
   workload characterization, or routing/scheduling algorithm?