Files
agentic-kvc/analysis/agentic_pd_unified_story_plan.md
Gahow Wang c63dc151a0 Agentic PD / Unified routing story plan draft
User's 2026-05-25 draft aligning three threads (agentic-kv vLLM
experiments, dash0 artifacts, agentic-pd-hybrid SGLang work) into
a single story for the paper. Tracked so future iterations and
review history are in version control.

Co-Authored-By: Gahow Wang <chiahaco@gmail.com>
2026-05-26 01:12:42 +08:00

17 KiB

Agentic PD / Unified Routing Story Plan

Status: draft for review Date: 2026-05-25

0. Goal

This document aligns three threads:

  1. agentic-kv: vLLM-based PD-colocation, full PD separation, LMetric, Unified routing, and elastic migration experiments.
  2. dash0:/home/admin/cpfs/wjh/agentic-kv: run artifacts and the latest PD-separation paper-section scaffold.
  3. ~/phd/agentic-pd-hybrid: SGLang/PPD/KVC experiments, including retractions and stricter framing around loadgen validity.

The purpose is to converge on a defensible story and a concrete task plan, not to force the old Unified routing hypothesis to be true.

1. Current Best Framing

1.1 Workload premise

Agentic serving is not chatbot serving.

  • Requests have long input contexts and short outputs.
  • Most reusable KV is intra-session, not cross-session.
  • Sessions are multi-turn and causally sequential: turn N+1 cannot be faithfully issued before turn N finishes.
  • Long-lived sessions create two competing needs:
    • keep cache locality for future turns;
    • avoid pinning all future work to an overloaded worker.

This workload makes cache locality a first-order system objective, but also makes naive session pinning dangerous.

1.2 System premise

PD separation is not a universally good abstraction. It helps only when:

saved decode interference > KV transfer + P queue + D queue + KV capacity cost

For agentic workloads, that inequality often fails because long-context KV is large and decode-side KV residency becomes the limiting resource.

1.3 Main thesis candidate

Static PD separation is the wrong default for single-node agentic serving. The stronger baseline is PD-colocation with cache-aware routing. The interesting open problem is not "separate prefill and decode everywhere", but:

how to preserve session-level KV locality while retaining enough routing freedom to avoid hot-worker queueing and decode interference.

Unified routing should be framed as an attempt at that problem. The current experiments show that the migration actuator was too expensive, so the story should distinguish the principle from the failed mechanism.

2. What We Should Align Across Repos

2.1 Naming / architecture mapping

Use one taxonomy consistently:

Name in paper/story vLLM repo term SGLang/KVC repo term Meaning
Replica / PD-colo combined / PD-colocated pd_colo, SGLang cache_aware all workers do prefill + decode
x=0 PD-disagg full PD separation pd_disagg every turn goes P then D
x=1 / append-prefill-on-D not implemented as such in vLLM experiments KVC / PPD-style direct-to-D turn 1 seeds D; later turns prefill locally on D
Elastic migration Unified PUSH / elastic offload smart migration / re-pin sessions move a session or a request away from overloaded worker
Hybrid routing current Unified baseline PD-colo + soft pin / kv-aware cache-aware LB plus explicit affinity only when worth it

Important distinction: vLLM Unified PUSH migration is not the same as PPD x=1. Unified PUSH still pays cross-instance KV movement for migrated requests. PPD x=1 tries to avoid P-to-D transfer on later turns by doing append-prefill directly on the resident D node.

2.2 Results that look stable

These are safe to build around:

  1. Full/static PD separation is weak for agentic on one node.

    • vLLM evidence: decode-side KV memory wall and transfer overhead.
    • SGLang evidence: x=0 PD-disagg is consistently worse than PD-colo.
  2. LMetric/cache-aware routing is a strong baseline.

    • In vLLM, corrected LMetric nearly matches session-sticky because new_tokens = input - cached creates implicit soft affinity.
    • In SGLang, cache_aware is the production-stable baseline and often wins or ties.
  3. Explicit session pinning is not automatically good.

    • It can improve cache hit rate.
    • It can create head-of-line blocking if sessions grow unevenly.
    • Initial placement matters; mid-session migration is desirable in principle but hard in practice.
  4. Transfer-based migration is currently too expensive.

    • vLLM experiments: forced/relaxed migration worsened E2E tail.
    • Mooncake path lacks enough overlap/layerwise benefit in current setup.
  5. Loadgen validity must be treated as substrate, not detail.

    • agentic-pd-hybrid explicitly retracted high-concurrency claims where session sequentiality was violated.
    • Future experiments must enforce per-session causality.

2.3 Results that need more careful wording

  1. "Unified beats LMetric" should not be stated as a strong result yet. The latest stable implementation is closer to LMetric plus a high-cache affinity gate. Expected gain is small by design.

  2. "PD separation is always bad" is too broad. The correct claim is conditional: static/full PD separation is net negative in the long-context, high-KV-footprint, single-node agentic regime we measured.

  3. "KVC/PPD wins" is not established for our stack. The SGLang repo contains useful PPD framing, but also several retractions: high-concurrency wins were affected by loadgen issues, and KVC stability was not production-ready in some runs.

  4. "Session migration will fix load balance" is still a hypothesis. It is valid as a first-principles goal, but the tested vLLM actuator did not satisfy the cost budget.

3. Proposed Storytelling Outline

Section A: Why agentic serving is different

Claim:

  • Agentic workloads combine long contexts, high intra-session reuse, and sequential multi-turn sessions.
  • This makes KV cache lifecycle and routing more important than the classic prefill/decode kernel dichotomy.

Evidence to use:

  • Input/output token CDF.
  • KV reuse decomposition: intra-session vs cross-session.
  • Session length / context growth examples.
  • Per-session sequentiality requirement.

Needed cleanup:

  • Use one trace definition and report sampling method.
  • Explicitly state whether a trace is online-realistic, synthetic burst, or stress test.

Section B: Why static PD separation fails

Claim:

  • The classic roofline premise is true but insufficient.
  • Prefill can be compute-bound while static PD separation still loses at the system level.

Mechanism:

  • PD separation relocates prefill; it does not reduce total prefill work.
  • It adds KV transfer.
  • It concentrates decode KV residency onto fewer D GPUs.
  • Long-context agentic requests hit a decode-side KV memory wall.

Evidence to use:

  • analysis/pd_sep_paper_section/system_analysis.md
  • C1 workload figures.
  • C6 roofline figure.
  • KV memory wall model.
  • Fresh PD matrix once rerun without forced eager mode.

Task implication:

  • Complete C2/C3/C4/C5 matrix before making this a paper-grade section.

Section C: Why cache-aware PD-colo is hard to beat

Claim:

  • Cache-aware routing already captures much of the desired session affinity.
  • LMetric's cache-adjusted prefill cost gives implicit soft affinity without hard pinning.

Mechanism:

  • A worker with cached prefix has lower new_tokens.
  • This naturally attracts later turns unless the worker is sufficiently loaded.
  • This is exactly the balance we want: preserve locality while retaining routing freedom.

Evidence to use:

  • Corrected LMetric vs Linear comparison.
  • APC distribution.
  • PD-colo stability from SGLang/KVC repo.

Task implication:

  • Treat LMetric/cache-aware PD-colo as the primary baseline, not round-robin or naive sticky.

Section D: Why Unified migration did not improve over LMetric

Claim:

  • Unified's principle was right, but the migration mechanism failed the cost budget.

Mechanism:

  • At conservative gates, too few requests migrate to change load balance.
  • At relaxed gates, migration overhead dominates.
  • Cold/heavy requests often cannot benefit from source cache and remain colocated.
  • Cached migration still pays P-side queue, KV movement, and D admission.
  • The cost model initially underestimated cache-attraction feedback and queue effects.

Evidence to use:

  • Git history: single argmin -> soft affinity -> decode load/hard gate -> forced migration -> revert -> hybrid LMetric.
  • Approach B / relaxed gate regressions.
  • 16-session contention: interference exists, but elastic RDMA made TPOT worse and offloaded too few requests.

Task implication:

  • Do not revive three-way argmin or aggressive PUSH migration.
  • Frame current Unified as hybrid LMetric plus selective affinity.

Section E: What remains promising

There are two different future paths. They should not be conflated.

Path 1: Conservative, vLLM-ready.

  • Stay PD-colocated.
  • Use corrected LMetric as base.
  • Add only explicit high-cache affinity / tie-break logic where it improves stability.
  • Improve scheduling: adaptive chunked prefill, decode-priority controls, better observability of queue and cache state.

Path 2: Research, PPD-style.

  • Turn 1 seeds session on D.
  • Later turns do append-prefill on resident D, avoiding P-to-D transfer.
  • Dynamic x chooses P vs D based on append size, P queue, D load, and SLO.
  • Requires stable implementation and strict loadgen validation.

The paper/story can say: transfer-based migration did not work; append- prefill-on-resident-D remains a different and potentially better actuator.

4. Design Direction Recommendation

4.1 Near-term path

Use PD-colo cache-aware as the production baseline and paper baseline.

Implement/validate only low-risk routing improvements:

  1. Pure LMetric baseline must stay separate and reproducible.
  2. Unified hybrid should be LMetric plus:
    • high-cache explicit affinity;
    • overload escape;
    • deterministic non-degenerate tie-break;
    • route-decision logging.
  3. No Mooncake/PUSH migration on the critical comparison path.

This gives a clean statement:

The best robust single-node policy we have is cache-aware PD-colocation. Unified hybrid is a small refinement, not a new disaggregation win.

4.2 Research path

If we want a stronger contribution beyond "PD-sep loses", the promising research direction is:

session-resident append-prefill with dynamic P/D selection.

This aligns better with PPD than vLLM PUSH migration does.

Key design principle:

  • Do not move KV just to run prefill elsewhere unless the future benefit is large enough to amortize the transfer.
  • Prefer using the worker that already owns the session KV, unless decode load or append size makes that choice violate SLO.

5. Experiment Plan

5.1 Must-have validity checks

For every benchmark:

  • Per-session sequentiality enforced.
  • Attempted/completed/error counts reported.
  • Pair by (session_id, turn_id) when comparing arms.
  • Report goodput, not only latency of successes.
  • Record git commit, launch flags, trace path, request limit, time scale, session sampling method, and hardware.

5.2 PD separation matrix

Goal: make the static PD-sep negative result paper-grade.

Arms:

  • PD-colo cache-aware.
  • PD-sep 4P+4D.
  • PD-sep 6P+2D.
  • Optional: round-robin baseline only as sanity, not main comparison.
  • Optional: eager vs cudagraph ablation.

Metrics:

  • TTFT/E2E/TPOT p50/p90/p99.
  • Goodput and error rate.
  • APC mean and per-instance distribution.
  • GPU util and decode-side KV occupancy time series.
  • TTFT breakdown: prefill, KV transfer, D wait.

Output:

  • C2 headline bar with error bars.
  • C3 KV utilization time series.
  • C4 TTFT stacked breakdown.
  • C5 cuda-graph ablation.

5.3 LMetric vs Unified hybrid

Goal: determine whether current Unified has any real gain over LMetric.

Arms:

  • Pure corrected LMetric.
  • Current Unified hybrid.

Run:

  • 3-5 paired trials on the same trace.
  • No Mooncake/PUSH.
  • Same launch flags.

Additional logging:

  • Route reason: lmetric, high_cache_affinity, overload_escape, tie_break.
  • Chosen instance load, cache hit, effective new tokens.

Decision rule:

  • If gain is within noise, do not oversell Unified as a performance win. Keep it as a policy cleanup / safety improvement.

5.4 Interference and scheduler experiments

Goal: test whether scheduling is the right actuator after routing saturates.

Arms:

  • Different chunked prefill sizes.
  • Decode-priority / prefill throttling if available.
  • High-concurrency but session-sequential trace.

Metrics:

  • TPOT under concurrent heavy prefills.
  • TTFT for heavy turns.
  • Decode queue delay.
  • GPU util timeline.

Expected value:

  • If migration is too expensive, reducing prefill interference in-place is the most plausible next improvement.

5.5 PPD/KVC-style research validation

Goal: separate PPD x=1 from failed x=0/full PD and failed transfer-based migration.

Arms:

  • PD-colo cache-aware.
  • x=0 PD-disagg.
  • x=1 append-prefill-on-D if implementation is stable.
  • Dynamic x if available.

Guardrails:

  • Do not use old high-concurrency KVC numbers without the loadgen caveat.
  • Do not compare partial successful subsets without goodput.
  • Treat SGLang implementation bugs as system results, not hidden noise.

6. Task Breakdown

Track 1: Documentation alignment

Owner task:

  • Update REPORT.md, docs/migration-policy-design.md, and analysis/research_findings.md so they use the taxonomy in section 2.

Concrete edits:

  • Mark single-argmin/PUSH Unified as historical.
  • State that current Unified is hybrid LMetric plus high-cache affinity.
  • Add mapping to PPD taxonomy: Replica, x=0 PD, x=1 append-prefill.
  • Add loadgen validity checklist.

Done when:

  • A reviewer can no longer confuse vLLM PUSH migration with PPD x=1.
  • LMetric baseline and Unified hybrid are described as separate policies.

Track 2: Current routing cleanup

Owner task:

  • Make current Unified hybrid auditable and minimal.

Concrete edits:

  • Remove stale unreachable PUSH code from scripts/cache_aware_proxy.py.
  • Keep pure --policy lmetric untouched.
  • Add route-decision fields for Unified hybrid.
  • Add tests:
    • pure LMetric remains pure;
    • high-cache affinity triggers only under its intended gate;
    • overload escape works;
    • empty-batch tie-break does not collapse to instance 0.

Done when:

  • pytest tests/test_proxy_pick.py covers LMetric and Unified separately.
  • Bench logs can count how often Unified did something beyond LMetric.

Track 3: PD-sep paper matrix

Owner task:

  • Finish the analysis/pd_sep_paper_section missing claims.

Concrete work:

  • Run bench_pd_matrix.sh on dash0.
  • Collect metrics.summary.json, breakdown.json, apc.txt, gpu_util.csv, and per-instance KV logs.
  • Add plotters for C2/C3/C4/C5.
  • Replace legacy C7 numbers with matrix outputs.

Done when:

  • The PD-sep negative result no longer relies on old --enforce-eager methodology or single snapshots.

Track 4: Benchmark substrate validation

Owner task:

  • Audit the vLLM replayer and any dash0 loadgen scripts for session sequentiality and arrival semantics.

Concrete checks:

  • Verify no session has more than one in-flight turn unless explicitly configured as a stress test.
  • Add an analyzer that reports max concurrent turns per session.
  • Report sampled session-start distribution.
  • Add goodput and error-rate comparisons to all summary scripts.

Done when:

  • We can label each experiment as online-realistic, burst stress, or synthetic microbench.

Track 5: Scheduler/interference path

Owner task:

  • Test whether in-place scheduling beats transfer-based migration.

Concrete experiments:

  • Chunk size sweep.
  • Decode-priority or prefill-throttle sweep.
  • 16+ session sequential replay.

Done when:

  • We know whether the next performance lever is scheduler policy or routing policy.

Owner task:

  • Extract the useful agentic-pd-hybrid lessons without importing invalid claims.

Concrete work:

  • Summarize:
    • loadgen bug and retractions;
    • PD-colo as stable baseline;
    • x=0 PD-disagg failure;
    • x=1/append-prefill motivation;
    • dynamic threshold lessons.
  • Decide whether this is mainline future work or an appendix framing.

Done when:

  • The story can cite PPD-style append-prefill as a distinct future actuator, not as evidence that the current Unified migration already works.

7. Proposed One-Sentence Story

Agentic serving breaks the classic PD-disaggregation intuition: long-lived sessions make KV locality dominant, while long contexts make decode-side KV capacity and transfer costs dominate the gains from isolating prefill; the robust design is cache-aware PD-colocation with carefully limited session affinity, and future disaggregation must be dynamic and session-resident rather than static or transfer-heavy.

8. Open Decisions For Review

  1. Do we want the main paper contribution to be the negative result "static PD separation fails for agentic", or the positive system "cache-aware PD-colo / Unified hybrid"?

  2. Is PPD-style x=1 append-prefill a future-work section, or do we need to implement a minimal stable version before finalizing the story?

  3. Should current Unified be presented as a named system if its measured improvement over LMetric is small, or should it be framed as an audit of why LMetric/cache-aware is already strong?

  4. Which trace is the canonical trace for claims: the vLLM trace in agentic-kv, the GLM-5.1 trace in agentic-pd-hybrid, or both with explicit regime labels?

  5. What is the target venue-style claim: systems negative result, workload characterization, or routing/scheduling algorithm?