diff --git a/analysis/agentic_pd_unified_story_plan.md b/analysis/agentic_pd_unified_story_plan.md
new file mode 100644
index 0000000..68c1328
--- /dev/null
+++ b/analysis/agentic_pd_unified_story_plan.md
@@ -0,0 +1,552 @@
+# Agentic PD / Unified Routing Story Plan
+
+Status: draft for review
+Date: 2026-05-25
+
+## 0. Goal
+
+This document aligns three threads:
+
+1. `agentic-kv`: vLLM-based PD-colocation, full PD separation, LMetric,
+   Unified routing, and elastic migration experiments.
+2. `dash0:/home/admin/cpfs/wjh/agentic-kv`: run artifacts and the
+   latest PD-separation paper-section scaffold.
+3. `~/phd/agentic-pd-hybrid`: SGLang/PPD/KVC experiments, including
+   retractions and stricter framing around loadgen validity.
+
+The purpose is to converge on a defensible story and a concrete task plan,
+not to force the old Unified routing hypothesis to be true.
+
+## 1. Current Best Framing
+
+### 1.1 Workload premise
+
+Agentic serving is not chatbot serving.
+
+- Requests have long input contexts and short outputs.
+- Most reusable KV is intra-session, not cross-session.
+- Sessions are multi-turn and causally sequential: turn N+1 cannot be
+  faithfully issued before turn N finishes.
+- Long-lived sessions create two competing needs:
+  - keep cache locality for future turns;
+  - avoid pinning all future work to an overloaded worker.
+
+This workload makes cache locality a first-order system objective, but
+also makes naive session pinning dangerous.
+
+### 1.2 System premise
+
+PD separation is not a universally good abstraction. It helps only when:
+
+```
+saved decode interference > KV transfer + P queue + D queue + KV capacity cost
+```
+
+For agentic workloads, that inequality often fails because long-context KV
+is large and decode-side KV residency becomes the limiting resource.
+
+### 1.3 Main thesis candidate
+
+Static PD separation is the wrong default for single-node agentic serving.
+The stronger baseline is PD-colocation with cache-aware routing. The
+interesting open problem is not "separate prefill and decode everywhere",
+but:
+
+> how to preserve session-level KV locality while retaining enough routing
+> freedom to avoid hot-worker queueing and decode interference.
+
+Unified routing should be framed as an attempt at that problem. The current
+experiments show that the migration actuator was too expensive, so the
+story should distinguish the principle from the failed mechanism.
+
+## 2. What We Should Align Across Repos
+
+### 2.1 Naming / architecture mapping
+
+Use one taxonomy consistently:
+
+| Name in paper/story | vLLM repo term | SGLang/KVC repo term | Meaning |
+|---|---|---|---|
+| Replica / PD-colo | combined / PD-colocated | `pd_colo`, SGLang `cache_aware` | all workers do prefill + decode |
+| x=0 PD-disagg | full PD separation | `pd_disagg` | every turn goes P then D |
+| x=1 / append-prefill-on-D | not implemented as such in vLLM experiments | KVC / PPD-style direct-to-D | turn 1 seeds D; later turns prefill locally on D |
+| Elastic migration | Unified PUSH / elastic offload | smart migration / re-pin sessions | move a session or a request away from overloaded worker |
+| Hybrid routing | current Unified baseline | PD-colo + soft pin / kv-aware | cache-aware LB plus explicit affinity only when worth it |
+
+Important distinction: vLLM Unified PUSH migration is not the same as PPD
+x=1. Unified PUSH still pays cross-instance KV movement for migrated
+requests. PPD x=1 tries to avoid P-to-D transfer on later turns by doing
+append-prefill directly on the resident D node.
+
+### 2.2 Results that look stable
+
+These are safe to build around:
+
+1. Full/static PD separation is weak for agentic on one node.
+   - vLLM evidence: decode-side KV memory wall and transfer overhead.
+   - SGLang evidence: x=0 PD-disagg is consistently worse than PD-colo.
+
+2. LMetric/cache-aware routing is a strong baseline.
+   - In vLLM, corrected LMetric nearly matches session-sticky because
+     `new_tokens = input - cached` creates implicit soft affinity.
+   - In SGLang, `cache_aware` is the production-stable baseline and often
+     wins or ties.
+
+3. Explicit session pinning is not automatically good.
+   - It can improve cache hit rate.
+   - It can create head-of-line blocking if sessions grow unevenly.
+   - Initial placement matters; mid-session migration is desirable in
+     principle but hard in practice.
+
+4. Transfer-based migration is currently too expensive.
+   - vLLM experiments: forced/relaxed migration worsened E2E tail.
+   - Mooncake path lacks enough overlap/layerwise benefit in current setup.
+
+5. Loadgen validity must be treated as substrate, not detail.
+   - `agentic-pd-hybrid` explicitly retracted high-concurrency claims
+     where session sequentiality was violated.
+   - Future experiments must enforce per-session causality.
+
+### 2.3 Results that need more careful wording
+
+1. "Unified beats LMetric" should not be stated as a strong result yet.
+   The latest stable implementation is closer to LMetric plus a high-cache
+   affinity gate. Expected gain is small by design.
+
+2. "PD separation is always bad" is too broad.
+   The correct claim is conditional: static/full PD separation is net
+   negative in the long-context, high-KV-footprint, single-node agentic
+   regime we measured.
+
+3. "KVC/PPD wins" is not established for our stack.
+   The SGLang repo contains useful PPD framing, but also several retractions:
+   high-concurrency wins were affected by loadgen issues, and KVC stability
+   was not production-ready in some runs.
+
+4. "Session migration will fix load balance" is still a hypothesis.
+   It is valid as a first-principles goal, but the tested vLLM actuator
+   did not satisfy the cost budget.
+
+## 3. Proposed Storytelling Outline
+
+### Section A: Why agentic serving is different
+
+Claim:
+
+- Agentic workloads combine long contexts, high intra-session reuse, and
+  sequential multi-turn sessions.
+- This makes KV cache lifecycle and routing more important than the classic
+  prefill/decode kernel dichotomy.
+
+Evidence to use:
+
+- Input/output token CDF.
+- KV reuse decomposition: intra-session vs cross-session.
+- Session length / context growth examples.
+- Per-session sequentiality requirement.
+
+Needed cleanup:
+
+- Use one trace definition and report sampling method.
+- Explicitly state whether a trace is online-realistic, synthetic burst,
+  or stress test.
+
+### Section B: Why static PD separation fails
+
+Claim:
+
+- The classic roofline premise is true but insufficient.
+- Prefill can be compute-bound while static PD separation still loses at
+  the system level.
+
+Mechanism:
+
+- PD separation relocates prefill; it does not reduce total prefill work.
+- It adds KV transfer.
+- It concentrates decode KV residency onto fewer D GPUs.
+- Long-context agentic requests hit a decode-side KV memory wall.
+
+Evidence to use:
+
+- `analysis/pd_sep_paper_section/system_analysis.md`
+- C1 workload figures.
+- C6 roofline figure.
+- KV memory wall model.
+- Fresh PD matrix once rerun without forced eager mode.
+
+Task implication:
+
+- Complete C2/C3/C4/C5 matrix before making this a paper-grade section.
+
+### Section C: Why cache-aware PD-colo is hard to beat
+
+Claim:
+
+- Cache-aware routing already captures much of the desired session affinity.
+- LMetric's cache-adjusted prefill cost gives implicit soft affinity without
+  hard pinning.
+
+Mechanism:
+
+- A worker with cached prefix has lower `new_tokens`.
+- This naturally attracts later turns unless the worker is sufficiently
+  loaded.
+- This is exactly the balance we want: preserve locality while retaining
+  routing freedom.
+
+Evidence to use:
+
+- Corrected LMetric vs Linear comparison.
+- APC distribution.
+- PD-colo stability from SGLang/KVC repo.
+
+Task implication:
+
+- Treat LMetric/cache-aware PD-colo as the primary baseline, not round-robin
+  or naive sticky.
+
+### Section D: Why Unified migration did not improve over LMetric
+
+Claim:
+
+- Unified's principle was right, but the migration mechanism failed the
+  cost budget.
+
+Mechanism:
+
+- At conservative gates, too few requests migrate to change load balance.
+- At relaxed gates, migration overhead dominates.
+- Cold/heavy requests often cannot benefit from source cache and remain
+  colocated.
+- Cached migration still pays P-side queue, KV movement, and D admission.
+- The cost model initially underestimated cache-attraction feedback and
+  queue effects.
+
+Evidence to use:
+
+- Git history: single argmin -> soft affinity -> decode load/hard gate ->
+  forced migration -> revert -> hybrid LMetric.
+- Approach B / relaxed gate regressions.
+- 16-session contention: interference exists, but elastic RDMA made TPOT
+  worse and offloaded too few requests.
+
+Task implication:
+
+- Do not revive three-way argmin or aggressive PUSH migration.
+- Frame current Unified as hybrid LMetric plus selective affinity.
+
+### Section E: What remains promising
+
+There are two different future paths. They should not be conflated.
+
+Path 1: Conservative, vLLM-ready.
+
+- Stay PD-colocated.
+- Use corrected LMetric as base.
+- Add only explicit high-cache affinity / tie-break logic where it improves
+  stability.
+- Improve scheduling: adaptive chunked prefill, decode-priority controls,
+  better observability of queue and cache state.
+
+Path 2: Research, PPD-style.
+
+- Turn 1 seeds session on D.
+- Later turns do append-prefill on resident D, avoiding P-to-D transfer.
+- Dynamic x chooses P vs D based on append size, P queue, D load, and SLO.
+- Requires stable implementation and strict loadgen validation.
+
+The paper/story can say: transfer-based migration did not work; append-
+prefill-on-resident-D remains a different and potentially better actuator.
+
+## 4. Design Direction Recommendation
+
+### 4.1 Near-term path
+
+Use PD-colo cache-aware as the production baseline and paper baseline.
+
+Implement/validate only low-risk routing improvements:
+
+1. Pure LMetric baseline must stay separate and reproducible.
+2. Unified hybrid should be LMetric plus:
+   - high-cache explicit affinity;
+   - overload escape;
+   - deterministic non-degenerate tie-break;
+   - route-decision logging.
+3. No Mooncake/PUSH migration on the critical comparison path.
+
+This gives a clean statement:
+
+> The best robust single-node policy we have is cache-aware PD-colocation.
+> Unified hybrid is a small refinement, not a new disaggregation win.
+
+### 4.2 Research path
+
+If we want a stronger contribution beyond "PD-sep loses", the promising
+research direction is:
+
+> session-resident append-prefill with dynamic P/D selection.
+
+This aligns better with PPD than vLLM PUSH migration does.
+
+Key design principle:
+
+- Do not move KV just to run prefill elsewhere unless the future benefit is
+  large enough to amortize the transfer.
+- Prefer using the worker that already owns the session KV, unless decode
+  load or append size makes that choice violate SLO.
+
+## 5. Experiment Plan
+
+### 5.1 Must-have validity checks
+
+For every benchmark:
+
+- Per-session sequentiality enforced.
+- Attempted/completed/error counts reported.
+- Pair by `(session_id, turn_id)` when comparing arms.
+- Report goodput, not only latency of successes.
+- Record git commit, launch flags, trace path, request limit, time scale,
+  session sampling method, and hardware.
+
+### 5.2 PD separation matrix
+
+Goal: make the static PD-sep negative result paper-grade.
+
+Arms:
+
+- PD-colo cache-aware.
+- PD-sep 4P+4D.
+- PD-sep 6P+2D.
+- Optional: round-robin baseline only as sanity, not main comparison.
+- Optional: eager vs cudagraph ablation.
+
+Metrics:
+
+- TTFT/E2E/TPOT p50/p90/p99.
+- Goodput and error rate.
+- APC mean and per-instance distribution.
+- GPU util and decode-side KV occupancy time series.
+- TTFT breakdown: prefill, KV transfer, D wait.
+
+Output:
+
+- C2 headline bar with error bars.
+- C3 KV utilization time series.
+- C4 TTFT stacked breakdown.
+- C5 cuda-graph ablation.
+
+### 5.3 LMetric vs Unified hybrid
+
+Goal: determine whether current Unified has any real gain over LMetric.
+
+Arms:
+
+- Pure corrected LMetric.
+- Current Unified hybrid.
+
+Run:
+
+- 3-5 paired trials on the same trace.
+- No Mooncake/PUSH.
+- Same launch flags.
+
+Additional logging:
+
+- Route reason: `lmetric`, `high_cache_affinity`, `overload_escape`,
+  `tie_break`.
+- Chosen instance load, cache hit, effective new tokens.
+
+Decision rule:
+
+- If gain is within noise, do not oversell Unified as a performance win.
+  Keep it as a policy cleanup / safety improvement.
+
+### 5.4 Interference and scheduler experiments
+
+Goal: test whether scheduling is the right actuator after routing saturates.
+
+Arms:
+
+- Different chunked prefill sizes.
+- Decode-priority / prefill throttling if available.
+- High-concurrency but session-sequential trace.
+
+Metrics:
+
+- TPOT under concurrent heavy prefills.
+- TTFT for heavy turns.
+- Decode queue delay.
+- GPU util timeline.
+
+Expected value:
+
+- If migration is too expensive, reducing prefill interference in-place is
+  the most plausible next improvement.
+
+### 5.5 PPD/KVC-style research validation
+
+Goal: separate PPD x=1 from failed x=0/full PD and failed transfer-based
+migration.
+
+Arms:
+
+- PD-colo cache-aware.
+- x=0 PD-disagg.
+- x=1 append-prefill-on-D if implementation is stable.
+- Dynamic x if available.
+
+Guardrails:
+
+- Do not use old high-concurrency KVC numbers without the loadgen caveat.
+- Do not compare partial successful subsets without goodput.
+- Treat SGLang implementation bugs as system results, not hidden noise.
+
+## 6. Task Breakdown
+
+### Track 1: Documentation alignment
+
+Owner task:
+
+- Update `REPORT.md`, `docs/migration-policy-design.md`, and
+  `analysis/research_findings.md` so they use the taxonomy in section 2.
+
+Concrete edits:
+
+- Mark single-argmin/PUSH Unified as historical.
+- State that current Unified is hybrid LMetric plus high-cache affinity.
+- Add mapping to PPD taxonomy: Replica, x=0 PD, x=1 append-prefill.
+- Add loadgen validity checklist.
+
+Done when:
+
+- A reviewer can no longer confuse vLLM PUSH migration with PPD x=1.
+- LMetric baseline and Unified hybrid are described as separate policies.
+
+### Track 2: Current routing cleanup
+
+Owner task:
+
+- Make current Unified hybrid auditable and minimal.
+
+Concrete edits:
+
+- Remove stale unreachable PUSH code from `scripts/cache_aware_proxy.py`.
+- Keep pure `--policy lmetric` untouched.
+- Add route-decision fields for Unified hybrid.
+- Add tests:
+  - pure LMetric remains pure;
+  - high-cache affinity triggers only under its intended gate;
+  - overload escape works;
+  - empty-batch tie-break does not collapse to instance 0.
+
+Done when:
+
+- `pytest tests/test_proxy_pick.py` covers LMetric and Unified separately.
+- Bench logs can count how often Unified did something beyond LMetric.
+
+### Track 3: PD-sep paper matrix
+
+Owner task:
+
+- Finish the `analysis/pd_sep_paper_section` missing claims.
+
+Concrete work:
+
+- Run `bench_pd_matrix.sh` on dash0.
+- Collect `metrics.summary.json`, `breakdown.json`, `apc.txt`,
+  `gpu_util.csv`, and per-instance KV logs.
+- Add plotters for C2/C3/C4/C5.
+- Replace legacy C7 numbers with matrix outputs.
+
+Done when:
+
+- The PD-sep negative result no longer relies on old `--enforce-eager`
+  methodology or single snapshots.
+
+### Track 4: Benchmark substrate validation
+
+Owner task:
+
+- Audit the vLLM replayer and any dash0 loadgen scripts for session
+  sequentiality and arrival semantics.
+
+Concrete checks:
+
+- Verify no session has more than one in-flight turn unless explicitly
+  configured as a stress test.
+- Add an analyzer that reports max concurrent turns per session.
+- Report sampled session-start distribution.
+- Add goodput and error-rate comparisons to all summary scripts.
+
+Done when:
+
+- We can label each experiment as online-realistic, burst stress, or
+  synthetic microbench.
+
+### Track 5: Scheduler/interference path
+
+Owner task:
+
+- Test whether in-place scheduling beats transfer-based migration.
+
+Concrete experiments:
+
+- Chunk size sweep.
+- Decode-priority or prefill-throttle sweep.
+- 16+ session sequential replay.
+
+Done when:
+
+- We know whether the next performance lever is scheduler policy or routing
+  policy.
+
+### Track 6: PPD-style appendix / related design
+
+Owner task:
+
+- Extract the useful `agentic-pd-hybrid` lessons without importing invalid
+  claims.
+
+Concrete work:
+
+- Summarize:
+  - loadgen bug and retractions;
+  - PD-colo as stable baseline;
+  - x=0 PD-disagg failure;
+  - x=1/append-prefill motivation;
+  - dynamic threshold lessons.
+- Decide whether this is mainline future work or an appendix framing.
+
+Done when:
+
+- The story can cite PPD-style append-prefill as a distinct future actuator,
+  not as evidence that the current Unified migration already works.
+
+## 7. Proposed One-Sentence Story
+
+Agentic serving breaks the classic PD-disaggregation intuition: long-lived
+sessions make KV locality dominant, while long contexts make decode-side KV
+capacity and transfer costs dominate the gains from isolating prefill; the
+robust design is cache-aware PD-colocation with carefully limited session
+affinity, and future disaggregation must be dynamic and session-resident
+rather than static or transfer-heavy.
+
+## 8. Open Decisions For Review
+
+1. Do we want the main paper contribution to be the negative result
+   "static PD separation fails for agentic", or the positive system
+   "cache-aware PD-colo / Unified hybrid"?
+
+2. Is PPD-style x=1 append-prefill a future-work section, or do we need to
+   implement a minimal stable version before finalizing the story?
+
+3. Should current Unified be presented as a named system if its measured
+   improvement over LMetric is small, or should it be framed as an audit of
+   why LMetric/cache-aware is already strong?
+
+4. Which trace is the canonical trace for claims: the vLLM trace in
+   `agentic-kv`, the GLM-5.1 trace in `agentic-pd-hybrid`, or both with
+   explicit regime labels?
+
+5. What is the target venue-style claim: systems negative result,
+   workload characterization, or routing/scheduling algorithm?