Agentic PD / Unified routing story plan draft
User's 2026-05-25 draft aligning three threads (agentic-kv vLLM experiments, dash0 artifacts, agentic-pd-hybrid SGLang work) into a single story for the paper. Tracked so future iterations and review history are in version control. Co-Authored-By: Gahow Wang <chiahaco@gmail.com>
This commit is contained in:
552
analysis/agentic_pd_unified_story_plan.md
Normal file
552
analysis/agentic_pd_unified_story_plan.md
Normal file
@@ -0,0 +1,552 @@
|
|||||||
|
# Agentic PD / Unified Routing Story Plan
|
||||||
|
|
||||||
|
Status: draft for review
|
||||||
|
Date: 2026-05-25
|
||||||
|
|
||||||
|
## 0. Goal
|
||||||
|
|
||||||
|
This document aligns three threads:
|
||||||
|
|
||||||
|
1. `agentic-kv`: vLLM-based PD-colocation, full PD separation, LMetric,
|
||||||
|
Unified routing, and elastic migration experiments.
|
||||||
|
2. `dash0:/home/admin/cpfs/wjh/agentic-kv`: run artifacts and the
|
||||||
|
latest PD-separation paper-section scaffold.
|
||||||
|
3. `~/phd/agentic-pd-hybrid`: SGLang/PPD/KVC experiments, including
|
||||||
|
retractions and stricter framing around loadgen validity.
|
||||||
|
|
||||||
|
The purpose is to converge on a defensible story and a concrete task plan,
|
||||||
|
not to force the old Unified routing hypothesis to be true.
|
||||||
|
|
||||||
|
## 1. Current Best Framing
|
||||||
|
|
||||||
|
### 1.1 Workload premise
|
||||||
|
|
||||||
|
Agentic serving is not chatbot serving.
|
||||||
|
|
||||||
|
- Requests have long input contexts and short outputs.
|
||||||
|
- Most reusable KV is intra-session, not cross-session.
|
||||||
|
- Sessions are multi-turn and causally sequential: turn N+1 cannot be
|
||||||
|
faithfully issued before turn N finishes.
|
||||||
|
- Long-lived sessions create two competing needs:
|
||||||
|
- keep cache locality for future turns;
|
||||||
|
- avoid pinning all future work to an overloaded worker.
|
||||||
|
|
||||||
|
This workload makes cache locality a first-order system objective, but
|
||||||
|
also makes naive session pinning dangerous.
|
||||||
|
|
||||||
|
### 1.2 System premise
|
||||||
|
|
||||||
|
PD separation is not a universally good abstraction. It helps only when:
|
||||||
|
|
||||||
|
```
|
||||||
|
saved decode interference > KV transfer + P queue + D queue + KV capacity cost
|
||||||
|
```
|
||||||
|
|
||||||
|
For agentic workloads, that inequality often fails because long-context KV
|
||||||
|
is large and decode-side KV residency becomes the limiting resource.
|
||||||
|
|
||||||
|
### 1.3 Main thesis candidate
|
||||||
|
|
||||||
|
Static PD separation is the wrong default for single-node agentic serving.
|
||||||
|
The stronger baseline is PD-colocation with cache-aware routing. The
|
||||||
|
interesting open problem is not "separate prefill and decode everywhere",
|
||||||
|
but:
|
||||||
|
|
||||||
|
> how to preserve session-level KV locality while retaining enough routing
|
||||||
|
> freedom to avoid hot-worker queueing and decode interference.
|
||||||
|
|
||||||
|
Unified routing should be framed as an attempt at that problem. The current
|
||||||
|
experiments show that the migration actuator was too expensive, so the
|
||||||
|
story should distinguish the principle from the failed mechanism.
|
||||||
|
|
||||||
|
## 2. What We Should Align Across Repos
|
||||||
|
|
||||||
|
### 2.1 Naming / architecture mapping
|
||||||
|
|
||||||
|
Use one taxonomy consistently:
|
||||||
|
|
||||||
|
| Name in paper/story | vLLM repo term | SGLang/KVC repo term | Meaning |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Replica / PD-colo | combined / PD-colocated | `pd_colo`, SGLang `cache_aware` | all workers do prefill + decode |
|
||||||
|
| x=0 PD-disagg | full PD separation | `pd_disagg` | every turn goes P then D |
|
||||||
|
| x=1 / append-prefill-on-D | not implemented as such in vLLM experiments | KVC / PPD-style direct-to-D | turn 1 seeds D; later turns prefill locally on D |
|
||||||
|
| Elastic migration | Unified PUSH / elastic offload | smart migration / re-pin sessions | move a session or a request away from overloaded worker |
|
||||||
|
| Hybrid routing | current Unified baseline | PD-colo + soft pin / kv-aware | cache-aware LB plus explicit affinity only when worth it |
|
||||||
|
|
||||||
|
Important distinction: vLLM Unified PUSH migration is not the same as PPD
|
||||||
|
x=1. Unified PUSH still pays cross-instance KV movement for migrated
|
||||||
|
requests. PPD x=1 tries to avoid P-to-D transfer on later turns by doing
|
||||||
|
append-prefill directly on the resident D node.
|
||||||
|
|
||||||
|
### 2.2 Results that look stable
|
||||||
|
|
||||||
|
These are safe to build around:
|
||||||
|
|
||||||
|
1. Full/static PD separation is weak for agentic on one node.
|
||||||
|
- vLLM evidence: decode-side KV memory wall and transfer overhead.
|
||||||
|
- SGLang evidence: x=0 PD-disagg is consistently worse than PD-colo.
|
||||||
|
|
||||||
|
2. LMetric/cache-aware routing is a strong baseline.
|
||||||
|
- In vLLM, corrected LMetric nearly matches session-sticky because
|
||||||
|
`new_tokens = input - cached` creates implicit soft affinity.
|
||||||
|
- In SGLang, `cache_aware` is the production-stable baseline and often
|
||||||
|
wins or ties.
|
||||||
|
|
||||||
|
3. Explicit session pinning is not automatically good.
|
||||||
|
- It can improve cache hit rate.
|
||||||
|
- It can create head-of-line blocking if sessions grow unevenly.
|
||||||
|
- Initial placement matters; mid-session migration is desirable in
|
||||||
|
principle but hard in practice.
|
||||||
|
|
||||||
|
4. Transfer-based migration is currently too expensive.
|
||||||
|
- vLLM experiments: forced/relaxed migration worsened E2E tail.
|
||||||
|
- Mooncake path lacks enough overlap/layerwise benefit in current setup.
|
||||||
|
|
||||||
|
5. Loadgen validity must be treated as substrate, not detail.
|
||||||
|
- `agentic-pd-hybrid` explicitly retracted high-concurrency claims
|
||||||
|
where session sequentiality was violated.
|
||||||
|
- Future experiments must enforce per-session causality.
|
||||||
|
|
||||||
|
### 2.3 Results that need more careful wording
|
||||||
|
|
||||||
|
1. "Unified beats LMetric" should not be stated as a strong result yet.
|
||||||
|
The latest stable implementation is closer to LMetric plus a high-cache
|
||||||
|
affinity gate. Expected gain is small by design.
|
||||||
|
|
||||||
|
2. "PD separation is always bad" is too broad.
|
||||||
|
The correct claim is conditional: static/full PD separation is net
|
||||||
|
negative in the long-context, high-KV-footprint, single-node agentic
|
||||||
|
regime we measured.
|
||||||
|
|
||||||
|
3. "KVC/PPD wins" is not established for our stack.
|
||||||
|
The SGLang repo contains useful PPD framing, but also several retractions:
|
||||||
|
high-concurrency wins were affected by loadgen issues, and KVC stability
|
||||||
|
was not production-ready in some runs.
|
||||||
|
|
||||||
|
4. "Session migration will fix load balance" is still a hypothesis.
|
||||||
|
It is valid as a first-principles goal, but the tested vLLM actuator
|
||||||
|
did not satisfy the cost budget.
|
||||||
|
|
||||||
|
## 3. Proposed Storytelling Outline
|
||||||
|
|
||||||
|
### Section A: Why agentic serving is different
|
||||||
|
|
||||||
|
Claim:
|
||||||
|
|
||||||
|
- Agentic workloads combine long contexts, high intra-session reuse, and
|
||||||
|
sequential multi-turn sessions.
|
||||||
|
- This makes KV cache lifecycle and routing more important than the classic
|
||||||
|
prefill/decode kernel dichotomy.
|
||||||
|
|
||||||
|
Evidence to use:
|
||||||
|
|
||||||
|
- Input/output token CDF.
|
||||||
|
- KV reuse decomposition: intra-session vs cross-session.
|
||||||
|
- Session length / context growth examples.
|
||||||
|
- Per-session sequentiality requirement.
|
||||||
|
|
||||||
|
Needed cleanup:
|
||||||
|
|
||||||
|
- Use one trace definition and report sampling method.
|
||||||
|
- Explicitly state whether a trace is online-realistic, synthetic burst,
|
||||||
|
or stress test.
|
||||||
|
|
||||||
|
### Section B: Why static PD separation fails
|
||||||
|
|
||||||
|
Claim:
|
||||||
|
|
||||||
|
- The classic roofline premise is true but insufficient.
|
||||||
|
- Prefill can be compute-bound while static PD separation still loses at
|
||||||
|
the system level.
|
||||||
|
|
||||||
|
Mechanism:
|
||||||
|
|
||||||
|
- PD separation relocates prefill; it does not reduce total prefill work.
|
||||||
|
- It adds KV transfer.
|
||||||
|
- It concentrates decode KV residency onto fewer D GPUs.
|
||||||
|
- Long-context agentic requests hit a decode-side KV memory wall.
|
||||||
|
|
||||||
|
Evidence to use:
|
||||||
|
|
||||||
|
- `analysis/pd_sep_paper_section/system_analysis.md`
|
||||||
|
- C1 workload figures.
|
||||||
|
- C6 roofline figure.
|
||||||
|
- KV memory wall model.
|
||||||
|
- Fresh PD matrix once rerun without forced eager mode.
|
||||||
|
|
||||||
|
Task implication:
|
||||||
|
|
||||||
|
- Complete C2/C3/C4/C5 matrix before making this a paper-grade section.
|
||||||
|
|
||||||
|
### Section C: Why cache-aware PD-colo is hard to beat
|
||||||
|
|
||||||
|
Claim:
|
||||||
|
|
||||||
|
- Cache-aware routing already captures much of the desired session affinity.
|
||||||
|
- LMetric's cache-adjusted prefill cost gives implicit soft affinity without
|
||||||
|
hard pinning.
|
||||||
|
|
||||||
|
Mechanism:
|
||||||
|
|
||||||
|
- A worker with cached prefix has lower `new_tokens`.
|
||||||
|
- This naturally attracts later turns unless the worker is sufficiently
|
||||||
|
loaded.
|
||||||
|
- This is exactly the balance we want: preserve locality while retaining
|
||||||
|
routing freedom.
|
||||||
|
|
||||||
|
Evidence to use:
|
||||||
|
|
||||||
|
- Corrected LMetric vs Linear comparison.
|
||||||
|
- APC distribution.
|
||||||
|
- PD-colo stability from SGLang/KVC repo.
|
||||||
|
|
||||||
|
Task implication:
|
||||||
|
|
||||||
|
- Treat LMetric/cache-aware PD-colo as the primary baseline, not round-robin
|
||||||
|
or naive sticky.
|
||||||
|
|
||||||
|
### Section D: Why Unified migration did not improve over LMetric
|
||||||
|
|
||||||
|
Claim:
|
||||||
|
|
||||||
|
- Unified's principle was right, but the migration mechanism failed the
|
||||||
|
cost budget.
|
||||||
|
|
||||||
|
Mechanism:
|
||||||
|
|
||||||
|
- At conservative gates, too few requests migrate to change load balance.
|
||||||
|
- At relaxed gates, migration overhead dominates.
|
||||||
|
- Cold/heavy requests often cannot benefit from source cache and remain
|
||||||
|
colocated.
|
||||||
|
- Cached migration still pays P-side queue, KV movement, and D admission.
|
||||||
|
- The cost model initially underestimated cache-attraction feedback and
|
||||||
|
queue effects.
|
||||||
|
|
||||||
|
Evidence to use:
|
||||||
|
|
||||||
|
- Git history: single argmin -> soft affinity -> decode load/hard gate ->
|
||||||
|
forced migration -> revert -> hybrid LMetric.
|
||||||
|
- Approach B / relaxed gate regressions.
|
||||||
|
- 16-session contention: interference exists, but elastic RDMA made TPOT
|
||||||
|
worse and offloaded too few requests.
|
||||||
|
|
||||||
|
Task implication:
|
||||||
|
|
||||||
|
- Do not revive three-way argmin or aggressive PUSH migration.
|
||||||
|
- Frame current Unified as hybrid LMetric plus selective affinity.
|
||||||
|
|
||||||
|
### Section E: What remains promising
|
||||||
|
|
||||||
|
There are two different future paths. They should not be conflated.
|
||||||
|
|
||||||
|
Path 1: Conservative, vLLM-ready.
|
||||||
|
|
||||||
|
- Stay PD-colocated.
|
||||||
|
- Use corrected LMetric as base.
|
||||||
|
- Add only explicit high-cache affinity / tie-break logic where it improves
|
||||||
|
stability.
|
||||||
|
- Improve scheduling: adaptive chunked prefill, decode-priority controls,
|
||||||
|
better observability of queue and cache state.
|
||||||
|
|
||||||
|
Path 2: Research, PPD-style.
|
||||||
|
|
||||||
|
- Turn 1 seeds session on D.
|
||||||
|
- Later turns do append-prefill on resident D, avoiding P-to-D transfer.
|
||||||
|
- Dynamic x chooses P vs D based on append size, P queue, D load, and SLO.
|
||||||
|
- Requires stable implementation and strict loadgen validation.
|
||||||
|
|
||||||
|
The paper/story can say: transfer-based migration did not work; append-
|
||||||
|
prefill-on-resident-D remains a different and potentially better actuator.
|
||||||
|
|
||||||
|
## 4. Design Direction Recommendation
|
||||||
|
|
||||||
|
### 4.1 Near-term path
|
||||||
|
|
||||||
|
Use PD-colo cache-aware as the production baseline and paper baseline.
|
||||||
|
|
||||||
|
Implement/validate only low-risk routing improvements:
|
||||||
|
|
||||||
|
1. Pure LMetric baseline must stay separate and reproducible.
|
||||||
|
2. Unified hybrid should be LMetric plus:
|
||||||
|
- high-cache explicit affinity;
|
||||||
|
- overload escape;
|
||||||
|
- deterministic non-degenerate tie-break;
|
||||||
|
- route-decision logging.
|
||||||
|
3. No Mooncake/PUSH migration on the critical comparison path.
|
||||||
|
|
||||||
|
This gives a clean statement:
|
||||||
|
|
||||||
|
> The best robust single-node policy we have is cache-aware PD-colocation.
|
||||||
|
> Unified hybrid is a small refinement, not a new disaggregation win.
|
||||||
|
|
||||||
|
### 4.2 Research path
|
||||||
|
|
||||||
|
If we want a stronger contribution beyond "PD-sep loses", the promising
|
||||||
|
research direction is:
|
||||||
|
|
||||||
|
> session-resident append-prefill with dynamic P/D selection.
|
||||||
|
|
||||||
|
This aligns better with PPD than vLLM PUSH migration does.
|
||||||
|
|
||||||
|
Key design principle:
|
||||||
|
|
||||||
|
- Do not move KV just to run prefill elsewhere unless the future benefit is
|
||||||
|
large enough to amortize the transfer.
|
||||||
|
- Prefer using the worker that already owns the session KV, unless decode
|
||||||
|
load or append size makes that choice violate SLO.
|
||||||
|
|
||||||
|
## 5. Experiment Plan
|
||||||
|
|
||||||
|
### 5.1 Must-have validity checks
|
||||||
|
|
||||||
|
For every benchmark:
|
||||||
|
|
||||||
|
- Per-session sequentiality enforced.
|
||||||
|
- Attempted/completed/error counts reported.
|
||||||
|
- Pair by `(session_id, turn_id)` when comparing arms.
|
||||||
|
- Report goodput, not only latency of successes.
|
||||||
|
- Record git commit, launch flags, trace path, request limit, time scale,
|
||||||
|
session sampling method, and hardware.
|
||||||
|
|
||||||
|
### 5.2 PD separation matrix
|
||||||
|
|
||||||
|
Goal: make the static PD-sep negative result paper-grade.
|
||||||
|
|
||||||
|
Arms:
|
||||||
|
|
||||||
|
- PD-colo cache-aware.
|
||||||
|
- PD-sep 4P+4D.
|
||||||
|
- PD-sep 6P+2D.
|
||||||
|
- Optional: round-robin baseline only as sanity, not main comparison.
|
||||||
|
- Optional: eager vs cudagraph ablation.
|
||||||
|
|
||||||
|
Metrics:
|
||||||
|
|
||||||
|
- TTFT/E2E/TPOT p50/p90/p99.
|
||||||
|
- Goodput and error rate.
|
||||||
|
- APC mean and per-instance distribution.
|
||||||
|
- GPU util and decode-side KV occupancy time series.
|
||||||
|
- TTFT breakdown: prefill, KV transfer, D wait.
|
||||||
|
|
||||||
|
Output:
|
||||||
|
|
||||||
|
- C2 headline bar with error bars.
|
||||||
|
- C3 KV utilization time series.
|
||||||
|
- C4 TTFT stacked breakdown.
|
||||||
|
- C5 cuda-graph ablation.
|
||||||
|
|
||||||
|
### 5.3 LMetric vs Unified hybrid
|
||||||
|
|
||||||
|
Goal: determine whether current Unified has any real gain over LMetric.
|
||||||
|
|
||||||
|
Arms:
|
||||||
|
|
||||||
|
- Pure corrected LMetric.
|
||||||
|
- Current Unified hybrid.
|
||||||
|
|
||||||
|
Run:
|
||||||
|
|
||||||
|
- 3-5 paired trials on the same trace.
|
||||||
|
- No Mooncake/PUSH.
|
||||||
|
- Same launch flags.
|
||||||
|
|
||||||
|
Additional logging:
|
||||||
|
|
||||||
|
- Route reason: `lmetric`, `high_cache_affinity`, `overload_escape`,
|
||||||
|
`tie_break`.
|
||||||
|
- Chosen instance load, cache hit, effective new tokens.
|
||||||
|
|
||||||
|
Decision rule:
|
||||||
|
|
||||||
|
- If gain is within noise, do not oversell Unified as a performance win.
|
||||||
|
Keep it as a policy cleanup / safety improvement.
|
||||||
|
|
||||||
|
### 5.4 Interference and scheduler experiments
|
||||||
|
|
||||||
|
Goal: test whether scheduling is the right actuator after routing saturates.
|
||||||
|
|
||||||
|
Arms:
|
||||||
|
|
||||||
|
- Different chunked prefill sizes.
|
||||||
|
- Decode-priority / prefill throttling if available.
|
||||||
|
- High-concurrency but session-sequential trace.
|
||||||
|
|
||||||
|
Metrics:
|
||||||
|
|
||||||
|
- TPOT under concurrent heavy prefills.
|
||||||
|
- TTFT for heavy turns.
|
||||||
|
- Decode queue delay.
|
||||||
|
- GPU util timeline.
|
||||||
|
|
||||||
|
Expected value:
|
||||||
|
|
||||||
|
- If migration is too expensive, reducing prefill interference in-place is
|
||||||
|
the most plausible next improvement.
|
||||||
|
|
||||||
|
### 5.5 PPD/KVC-style research validation
|
||||||
|
|
||||||
|
Goal: separate PPD x=1 from failed x=0/full PD and failed transfer-based
|
||||||
|
migration.
|
||||||
|
|
||||||
|
Arms:
|
||||||
|
|
||||||
|
- PD-colo cache-aware.
|
||||||
|
- x=0 PD-disagg.
|
||||||
|
- x=1 append-prefill-on-D if implementation is stable.
|
||||||
|
- Dynamic x if available.
|
||||||
|
|
||||||
|
Guardrails:
|
||||||
|
|
||||||
|
- Do not use old high-concurrency KVC numbers without the loadgen caveat.
|
||||||
|
- Do not compare partial successful subsets without goodput.
|
||||||
|
- Treat SGLang implementation bugs as system results, not hidden noise.
|
||||||
|
|
||||||
|
## 6. Task Breakdown
|
||||||
|
|
||||||
|
### Track 1: Documentation alignment
|
||||||
|
|
||||||
|
Owner task:
|
||||||
|
|
||||||
|
- Update `REPORT.md`, `docs/migration-policy-design.md`, and
|
||||||
|
`analysis/research_findings.md` so they use the taxonomy in section 2.
|
||||||
|
|
||||||
|
Concrete edits:
|
||||||
|
|
||||||
|
- Mark single-argmin/PUSH Unified as historical.
|
||||||
|
- State that current Unified is hybrid LMetric plus high-cache affinity.
|
||||||
|
- Add mapping to PPD taxonomy: Replica, x=0 PD, x=1 append-prefill.
|
||||||
|
- Add loadgen validity checklist.
|
||||||
|
|
||||||
|
Done when:
|
||||||
|
|
||||||
|
- A reviewer can no longer confuse vLLM PUSH migration with PPD x=1.
|
||||||
|
- LMetric baseline and Unified hybrid are described as separate policies.
|
||||||
|
|
||||||
|
### Track 2: Current routing cleanup
|
||||||
|
|
||||||
|
Owner task:
|
||||||
|
|
||||||
|
- Make current Unified hybrid auditable and minimal.
|
||||||
|
|
||||||
|
Concrete edits:
|
||||||
|
|
||||||
|
- Remove stale unreachable PUSH code from `scripts/cache_aware_proxy.py`.
|
||||||
|
- Keep pure `--policy lmetric` untouched.
|
||||||
|
- Add route-decision fields for Unified hybrid.
|
||||||
|
- Add tests:
|
||||||
|
- pure LMetric remains pure;
|
||||||
|
- high-cache affinity triggers only under its intended gate;
|
||||||
|
- overload escape works;
|
||||||
|
- empty-batch tie-break does not collapse to instance 0.
|
||||||
|
|
||||||
|
Done when:
|
||||||
|
|
||||||
|
- `pytest tests/test_proxy_pick.py` covers LMetric and Unified separately.
|
||||||
|
- Bench logs can count how often Unified did something beyond LMetric.
|
||||||
|
|
||||||
|
### Track 3: PD-sep paper matrix
|
||||||
|
|
||||||
|
Owner task:
|
||||||
|
|
||||||
|
- Finish the `analysis/pd_sep_paper_section` missing claims.
|
||||||
|
|
||||||
|
Concrete work:
|
||||||
|
|
||||||
|
- Run `bench_pd_matrix.sh` on dash0.
|
||||||
|
- Collect `metrics.summary.json`, `breakdown.json`, `apc.txt`,
|
||||||
|
`gpu_util.csv`, and per-instance KV logs.
|
||||||
|
- Add plotters for C2/C3/C4/C5.
|
||||||
|
- Replace legacy C7 numbers with matrix outputs.
|
||||||
|
|
||||||
|
Done when:
|
||||||
|
|
||||||
|
- The PD-sep negative result no longer relies on old `--enforce-eager`
|
||||||
|
methodology or single snapshots.
|
||||||
|
|
||||||
|
### Track 4: Benchmark substrate validation
|
||||||
|
|
||||||
|
Owner task:
|
||||||
|
|
||||||
|
- Audit the vLLM replayer and any dash0 loadgen scripts for session
|
||||||
|
sequentiality and arrival semantics.
|
||||||
|
|
||||||
|
Concrete checks:
|
||||||
|
|
||||||
|
- Verify no session has more than one in-flight turn unless explicitly
|
||||||
|
configured as a stress test.
|
||||||
|
- Add an analyzer that reports max concurrent turns per session.
|
||||||
|
- Report sampled session-start distribution.
|
||||||
|
- Add goodput and error-rate comparisons to all summary scripts.
|
||||||
|
|
||||||
|
Done when:
|
||||||
|
|
||||||
|
- We can label each experiment as online-realistic, burst stress, or
|
||||||
|
synthetic microbench.
|
||||||
|
|
||||||
|
### Track 5: Scheduler/interference path
|
||||||
|
|
||||||
|
Owner task:
|
||||||
|
|
||||||
|
- Test whether in-place scheduling beats transfer-based migration.
|
||||||
|
|
||||||
|
Concrete experiments:
|
||||||
|
|
||||||
|
- Chunk size sweep.
|
||||||
|
- Decode-priority or prefill-throttle sweep.
|
||||||
|
- 16+ session sequential replay.
|
||||||
|
|
||||||
|
Done when:
|
||||||
|
|
||||||
|
- We know whether the next performance lever is scheduler policy or routing
|
||||||
|
policy.
|
||||||
|
|
||||||
|
### Track 6: PPD-style appendix / related design
|
||||||
|
|
||||||
|
Owner task:
|
||||||
|
|
||||||
|
- Extract the useful `agentic-pd-hybrid` lessons without importing invalid
|
||||||
|
claims.
|
||||||
|
|
||||||
|
Concrete work:
|
||||||
|
|
||||||
|
- Summarize:
|
||||||
|
- loadgen bug and retractions;
|
||||||
|
- PD-colo as stable baseline;
|
||||||
|
- x=0 PD-disagg failure;
|
||||||
|
- x=1/append-prefill motivation;
|
||||||
|
- dynamic threshold lessons.
|
||||||
|
- Decide whether this is mainline future work or an appendix framing.
|
||||||
|
|
||||||
|
Done when:
|
||||||
|
|
||||||
|
- The story can cite PPD-style append-prefill as a distinct future actuator,
|
||||||
|
not as evidence that the current Unified migration already works.
|
||||||
|
|
||||||
|
## 7. Proposed One-Sentence Story
|
||||||
|
|
||||||
|
Agentic serving breaks the classic PD-disaggregation intuition: long-lived
|
||||||
|
sessions make KV locality dominant, while long contexts make decode-side KV
|
||||||
|
capacity and transfer costs dominate the gains from isolating prefill; the
|
||||||
|
robust design is cache-aware PD-colocation with carefully limited session
|
||||||
|
affinity, and future disaggregation must be dynamic and session-resident
|
||||||
|
rather than static or transfer-heavy.
|
||||||
|
|
||||||
|
## 8. Open Decisions For Review
|
||||||
|
|
||||||
|
1. Do we want the main paper contribution to be the negative result
|
||||||
|
"static PD separation fails for agentic", or the positive system
|
||||||
|
"cache-aware PD-colo / Unified hybrid"?
|
||||||
|
|
||||||
|
2. Is PPD-style x=1 append-prefill a future-work section, or do we need to
|
||||||
|
implement a minimal stable version before finalizing the story?
|
||||||
|
|
||||||
|
3. Should current Unified be presented as a named system if its measured
|
||||||
|
improvement over LMetric is small, or should it be framed as an audit of
|
||||||
|
why LMetric/cache-aware is already strong?
|
||||||
|
|
||||||
|
4. Which trace is the canonical trace for claims: the vLLM trace in
|
||||||
|
`agentic-kv`, the GLM-5.1 trace in `agentic-pd-hybrid`, or both with
|
||||||
|
explicit regime labels?
|
||||||
|
|
||||||
|
5. What is the target venue-style claim: systems negative result,
|
||||||
|
workload characterization, or routing/scheduling algorithm?
|
||||||
Reference in New Issue
Block a user