diff --git a/analysis/pd_separation_analysis.md b/analysis/pd_separation_analysis.md
index decaa38..8f7c01b 100644
--- a/analysis/pd_separation_analysis.md
+++ b/analysis/pd_separation_analysis.md
@@ -1,289 +1,236 @@
-# PD 分离在 Agentic Workload 下的系统分析
+# PD Disaggregation for Agentic LLM Workloads: A Systematic Study
 
-## 1. Trace 特征 (GLM-5.1 Agentic Coder, 2h, 2.1M requests)
+## TL;DR
 
-```
-Total requests:  2,114,220
-Input tokens:    71.1B (avg 33.6k/req, p50=20k, p90=88k)
-Output tokens:   940M  (avg 445/req, p50=80, p90=811)
-I/O ratio:       75.6x (aggregate), 217.8x (per-req median)
-Prefill share:   98% of total tokens
-Sessions:        1.3M (90% single-turn, 9% multi-turn)
-```
+We benchmarked PD separation (prefill-decode disaggregation) against PD co-location on a production agentic-coder trace (GLM-5.1, 2.1M requests, avg 33.6k input tokens). Under a fair comparison with the same cache-aware global scheduler:
 
-**与传统 chatbot workload 的根本区别：**
+**PD separation is net negative for single-machine agentic workloads.** The root cause is not what prior work (DistServe, Splitwise) targeted — it is a **KV cache memory wall** on decode instances.
 
-| 特征 | Traditional Chatbot | Agentic Coder (GLM-5.1) |
-|------|-------------------|------------------------|
-| I/O ratio | 1-10x | **75.6x** |
-| Input p50 | 500-2000 tokens | **20,030 tokens** |
-| Output p50 | 200-500 tokens | **80 tokens** |
-| Prefill token share | 50-80% | **98%** |
-| >32k input | <5% | **38%** |
-| Multi-turn | 50-80% | **9%** |
+| Config (TP=1, 8×H20) | TTFT p50 | TPOT p90 | GPU util | KV cache pressure |
+|---|---|---|---|---|
+| Combined DP=8 (cache-aware) | **0.731s** | **0.073s** | **30.5%** | Low (spread across 8 inst) |
+| PD-Sep 6P+2D (cache-aware) | 1.481s | 0.077s | 16.9% | **97.1% on decode** |
 
-**KV Cache 复用特征：**
+Per-request breakdown shows **87.7% of TTFT** is spent waiting for KV cache memory on decode instances, not prefill compute or KV transfer.
 
-```
-Unique hash blocks:        20,650,883
-Shared blocks (ref>1):     9,749,379 (47%)
-Highly shared (ref>10):    2,428,160
-Intra-session reuse:       57%
-Top-10 blocks ref count:   64,754 (system prompt blocks)
-Theoretical cache hit:     71% (infinite cache, first 100k requests)
-```
+---
 
-**Input length 分布与 token 占比：**
+## 1. Workload Characterization
 
-```
-     <1k:   202,396 reqs ( 9%)       89M tokens ( 0%)
-    1-8k:   380,009 reqs (17%)     1.6B tokens ( 2%)
-   8-32k:   720,871 reqs (34%)    12.7B tokens (17%)
-  32-65k:   405,371 reqs (19%)    19.4B tokens (27%)
- 65-131k:   394,014 reqs (18%)    35.7B tokens (50%)
-   >131k:    11,559 reqs ( 0%)     1.6B tokens ( 2%)
-```
+**Trace**: GLM-5.1 Agentic Coder, production cluster, 2 hours
 
-50% 的 token 计算量来自 65-131k 的长 context 请求。
+| Metric | Value |
+|--------|-------|
+| Requests | 2,114,220 |
+| Input tokens | 71.1B (avg 33.6k, p50=20k, p90=88k) |
+| Output tokens | 940M (avg 445, p50=80) |
+| I/O ratio | 75.6x aggregate, 217.8x per-request median |
+| Prefill token share | 98% |
+| Sessions | 1.3M (90% single-turn) |
+| >32k input | 38% of requests, 79% of tokens |
 
-## 2. DistServe 等 PD 分离的核心假设
+**KV cache reuse**:
 
-DistServe (OSDI'24), Splitwise, TetriInfer 等 PD 分离工作基于以下假设：
+| Metric | Value |
+|--------|-------|
+| Theoretical prefix cache hit (infinite, single inst) | 71% |
+| Shared hash blocks (ref>1) | 47% of unique blocks |
+| Intra-session reuse | 57% |
+| Top blocks ref count | 64,754 (system prompt) |
+| Actual APC (Combined, cache-aware, 8 inst) | 44.7% |
+| Actual APC (Round-robin, 8 inst) | 20.8% |
 
-### 假设 A: Prefill 和 Decode 有不同的计算特征
-- **Prefill**: compute-bound, 高 GPU 利用率, batch 越大越好
-- **Decode**: memory-bandwidth-bound, 低 GPU 利用率, latency-sensitive
+**Request profile after prefix cache**:
 
-**在 agentic workload 中的验证**: ✅ 成立，但需要细化
+| Bucket | Count | Avg new tokens to prefill |
+|--------|-------|--------------------------|
+| >90% cache hit (warm) | 22% | 1,314 |
+| 50-90% cache hit | 14% | 10,052 |
+| 1-50% cache hit | 8% | 38,909 |
+| 0% cache hit (cold) | 55% | 17,696 |
 
-Roofline 分析显示（详见 Section 5）：
+## 2. Experiment Setup
 
-```
-                    Arithmetic Intensity (FLOP/byte)
-  Decode:           1.0 - 1.9     (memory-bound, 始终远低于 ridge point)
-  Prefill 0% reuse: 23,000-72,000 (strongly compute-bound)
-  Prefill 70% reuse: 10,000-42,000 (仍然 compute-bound!)
-  Prefill 95% reuse: 1,900-10,800  (仍然 compute-bound!)
-  Ridge point (H20): 37
-```
+**Hardware**: 8× NVIDIA H20 (96GB HBM, NVLink, 4× ConnectX-7 200Gbps RDMA)
 
-**即使 95% KV cache reuse，prefill 仍然是 compute-bound。** 但绝对计算量大幅减少。
+**Software**: vLLM 0.18.1 (source in `third_party/vllm/`, patched scheduler assert), Mooncake 0.3.10 (RDMA KV transfer), uv-managed Python venv
 
-### 假设 B: PD co-location 导致互相干扰
-- Prefill 的大 batch 计算会抢占 GPU 资源，导致 decode 的 TPOT 升高
-- Decode 的持续小计算会占用 GPU 调度槽位，影响 prefill 吞吐
+**Model**: Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active params)
 
-**在 agentic workload 中的验证**: ⚠️ 干扰存在，但 **可被 cache-aware routing 消除**
+**Configurations tested** (all use same cache-aware + token-level LB global scheduler unless noted):
 
-```
-同一 cache-aware scheduler, TP=1, 8 GPU:
-  Combined TP=1 DP=8:  TPOT p90 = 0.073s
-  PD-Sep TP=1 4P+4D:   TPOT p90 = 0.074s
-  → 差异 <2%, 不显著
-```
+| Config | Instances | GPU allocation | Scheduler |
+|--------|-----------|----------------|-----------|
+| Combined TP=8 DP=1 | 1 | 8 GPU shared | N/A (single) |
+| Combined TP=2 DP=4 | 4 independent | 2 GPU each | RR (legacy) |
+| Combined TP=1 DP=8 | 8 independent | 1 GPU each | RR / cache-aware |
+| PD-Sep TP=1 4P+4D | 4P + 4D Mooncake | 4 GPU P, 4 GPU D | cache-aware |
+| PD-Sep TP=1 6P+2D | 6P + 2D Mooncake | 6 GPU P, 2 GPU D | cache-aware |
 
-对比 round-robin routing:
-```
-  Combined TP=1 DP=8 (RR):          TPOT p90 = 0.086s
-  Combined TP=1 DP=8 (cache-aware): TPOT p90 = 0.073s  → -15%
-  → routing 改善 > PD 分离改善
-```
+**Benchmark params**: 1000 sampled requests (200 for ablations), `--enforce-eager`, `--max-model-len 200000`
 
-**原因**: cache-aware routing 让 high-cache-hit 的请求集中到特定 instance，
-每个 instance 的实际 prefill 新 token 数大幅减少（71% 被 cache），
-prefill-decode 干扰因 prefill 工作量降低而自然缓解。
+**Trace sampler**: `scripts/sample_trace.py` — random session sampling preserving multi-turn structure + hash_ids
 
-### 假设 C: KV Cache 传输开销可以忽略
-- DistServe 假设 P→D 的 KV 传输延迟远小于 prefill 计算时间
-- 在 InfiniBand/NVLink 等高带宽互联下成立
+**Global scheduler**: `scripts/cache_aware_proxy.py` — supports both `--combined` (PD-colo) and `--prefill/--decode` (PD-sep) modes. Score = `ongoing_tokens/avg_load - α·cache_hit_ratio`, session affinity for multi-turn.
 
-**在 agentic workload 中的验证**: ❌ 不成立
+## 3. Results
 
-```
-PD-Sep TTFT p50 = 1.261s  vs  Combined TTFT p50 = 0.731s  (+72%)
-```
-
-原因：
-1. Agentic workload 的 input 极长（p50=20k, p90=88k tokens），KV cache 很大
-2. 单请求 KV cache = 20k tokens × 48 layers × 2(K+V) × 512 bytes ≈ 1GB
-3. 更重要的是 await-prefill 链路的串行延迟：proxy → prefill → KV transfer → decode → first token
-
-### 假设 D: 专用 prefill 节点可以提高 prefill 吞吐
-- Prefill 节点不做 decode，GPU 利用率更高
-- 可以用更大的 batch size
-
-**在 agentic workload 中的验证**: ⚠️ 收益被 cache 稀释
-
-```
-理论 prefix cache hit (infinite cache): 71% of input tokens
-实际 APC (Combined, cache-aware, 8 inst): 44.7%
-```
-
-71% cache hit → 只有 29% 的 input tokens 需要实际 prefill compute。
-Nominal avg input 33.6k → Actual avg new prefill ~9.7k tokens。
-专用 prefill 的 GPU 利用率优势因 prefill 工作量降低而缩小。
-
-## 3. Roofline 分析：Prefill 在高 Cache Reuse 下的计算/访存特性
-
-### 3.1 模型计算结构
-
-```
-Qwen3-Coder-30B-A3B (MoE 128E top-8):
-  48 layers, hidden=2048, heads=32, kv_heads=4 (GQA), head_dim=128
-  FFN: 6144 intermediate per expert, 8 experts active per token
-  Active params per token: ~3B
-
-H20 GPU: 148 TFLOPS (BF16), 4.0 TB/s HBM → Ridge point: 37 FLOP/byte
-```
-
-### 3.2 Decode 永远 memory-bound
-
-```
-SeqLen    FLOP        Bytes       AI (F/B)    Bound
-1,000     3.04e+10    3.01e+10    1.0         MEMORY
-16,000    3.63e+10    3.16e+10    1.1         MEMORY
-64,000    5.52e+10    3.63e+10    1.5         MEMORY
-128,000   8.03e+10    4.26e+10    1.9         MEMORY
-```
-
-Decode 的 AI 始终 < 2，远低于 ridge point (37)。每个 decode step 只处理 1 个 token，
-计算量极小，瓶颈在于读取模型权重和全量 KV cache。
-
-### 3.3 Prefill 即使 95% reuse 仍然 compute-bound
-
-```
-SeqLen   Reuse%  NewTok    AI (F/B)    Bound       vs Decode
-32,000      0%   32,000     23,368     COMPUTE     18,190x
-32,000     50%   16,000     14,899     COMPUTE     11,597x
-32,000     70%    9,600     10,045     COMPUTE      7,819x
-32,000     90%    3,200      3,821     COMPUTE      2,974x
-32,000     95%    1,600      1,980     COMPUTE      1,542x
-
-64,000      0%   64,000     40,758     COMPUTE     26,813x
-64,000     70%   19,200     20,610     COMPUTE     13,559x
-64,000     90%    6,400      8,544     COMPUTE      5,621x
-64,000     95%    3,200      4,549     COMPUTE      2,993x
-```
-
-### 3.4 为什么高 reuse 不改变 compute-bound 性质
-
-KV cache reuse 减少的：
-- K/V projection 计算（只算 new tokens）
-- KV 写入（只写 new tokens）
-
-KV cache reuse **不减少**的：
-- **Q×K^T attention**: 每个 new Q 都要和全部 seq_len 个 KV 做 attention
-  ```
-  FLOPs = new_tokens × seq_len × head_dim × num_heads × 2 × num_layers
-  ```
-  At 95% reuse, 32k seq: 1600 × 32000 × 128 × 32 × 2 × 48 ≈ 2×10^13
-  这个二次项在长 context 下主导总计算量
-
-- **MoE FFN**: 每个 new token 激活 8 experts
-  ```
-  FLOPs = new_tokens × 3 × D × D_ffn × 2 × K_experts × num_layers
-  ```
-
-**Prefill 只在接近 100% reuse (< 10 new tokens) 时才变成 memory-bound。**
-
-### 3.5 Prefill 什么时候变 memory-bound
-
-```
-SeqLen=32,000: new_tokens ≈ 5-10 时 → AI ≈ 37 (ridge point)
-SeqLen=64,000: new_tokens ≈ 5-10 时 → AI ≈ 37
-```
-
-在实际 agentic trace 中：
-```
-Compute-bound prefills: 961 (96%)
-Memory-bound prefills:  37 (3%)   ← 近 100% reuse 的极端 warm 请求
-```
-
-### 3.6 关键洞察："Compute-bound but lightweight"
-
-高 cache reuse 下的 prefill 处于一种独特状态：
-
-```
-  Prefill bound 类型:  Compute-bound (不变)
-  Prefill 绝对工作量:  大幅降低 (71% cache → 只算 29% 的 tokens)
-  Prefill-Decode 干扰: 因绝对工作量降低而减轻 (不需要物理隔离)
-```
-
-这解释了为什么 PD 分离没有帮助：
-- PD 分离解决的是 "prefill 太重干扰 decode" 的问题
-- 但 cache-aware routing 已经把 prefill 的实际工作量降到足够轻
-- 物理隔离（PD 分离）的收益被 KV 传输开销抵消
-
-## 4. 实验结果
-
-### 4.1 完整实验矩阵
-
-所有实验使用统一的 cache-aware + token-level load-balanced global scheduler。
+### 3.1 Main Comparison (unified cache-aware scheduler)
 
 | Config | OK/N | TTFT p50 | TPOT p90 | E2E p50 | APC |
 |--------|------|----------|----------|---------|-----|
-| TP=8 DP=1 (single instance) | 998/1000 | 0.467s | 0.129s | 3.30s | 53.0% |
-| TP=2 DP=4 (4 inst, RR) | 997/999 | 0.844s | 0.095s | 4.92s | 33.5% |
-| TP=1 DP=8 (8 inst, RR) | 997/999 | 1.836s | 0.086s | 6.67s | 20.8% |
-| **TP=1 DP=8 (cache-aware)** | **997/999** | **0.731s** | **0.073s** | **4.48s** | **44.7%** |
-| TP=1 PD-Sep 4P+4D (cache-aware) | 509/564 | 1.261s | 0.074s | 5.61s | 40.2% |
+| Combined TP=1 DP=8 (cache-aware) | 997/999 | **0.731s** | **0.073s** | **4.48s** | **44.7%** |
+| PD-Sep TP=1 4P+4D (cache-aware) | 509/564 | 1.261s | 0.074s | 5.61s | 40.2% |
+| Combined TP=1 DP=8 (RR) | 997/999 | 1.836s | 0.086s | 6.67s | 20.8% |
 
-### 4.2 Cache-Aware Routing 的效果
+### 3.2 GPU Utilization (200 req, time_scale=20)
+
+| Config | All GPU mean | Prefill GPU | Decode GPU | Decode KV cache |
+|--------|-------------|-------------|------------|-----------------|
+| Combined 8colo | **30.5%** (active 64%) | — | — | Distributed |
+| PD-Sep 4P+4D | 12.4% (active 24%) | 16.9% (active 17%) | 7.8% (active 30%) | ~97% |
+| PD-Sep 6P+2D | 16.9% (active 28%) | 16.2% (active 16%) | 19.0% (active 64%) | ~97% |
+
+### 3.3 Per-Request Breakdown (6P+2D, await mode)
+
+| Stage | p50 | % of TTFT |
+|-------|-----|-----------|
+| Prefill (queue + compute + KV push) | 0.108s | 12.3% |
+| Proxy overhead | 0.000s | 0.0% |
+| **KV pull + decode wait** | **109.6s** | **87.7%** |
+| Total TTFT | 110.2s | 100% |
+
+Root cause of 109.6s `kv+decode`: vLLM decode log shows `Running: 0 reqs, Waiting: 6 reqs, KV cache: 97.1%`. GPU idle, requests queued for KV cache memory.
+
+### 3.4 Ablations
+
+| Ablation | Change | TTFT | TPOT p90 | Verdict |
+|----------|--------|------|----------|---------|
+| P/D ratio: 6P+2D vs 4P+4D | More prefill GPUs | -26% | ~same | **Helps TTFT** (less prefill queue) |
+| Fire-and-forget vs await | Async prefill dispatch | +260% | -44% | **Hurts** (decode KV cache contention) |
+
+## 4. Analysis
+
+### 4.1 DistServe's Assumptions vs Agentic Reality
+
+| Assumption | Chatbot (DistServe) | Agentic (this work) |
+|------------|-------------------|---------------------|
+| A. P is compute-bound, D is memory-bound | ✅ | ✅ Even at 95% reuse, prefill AI >1000x vs decode AI <2 |
+| B. PD co-location causes interference | ✅ | ❌ Cache-aware routing eliminates interference (TPOT 0.073 vs 0.074) |
+| C. KV transfer cost negligible | ✅ (short input) | ❌ Avg 33.6k tokens, TTFT +72% from transfer |
+| D. Dedicated prefill improves throughput | ✅ | ❌ 71% cache hit → prefill already lightweight |
+| **E. Decode KV cache not a bottleneck** | **✅ (short context)** | **❌ THE bottleneck: 97% KV cache on decode** |
+
+### 4.2 Roofline: Prefill Stays Compute-Bound Under High Cache Reuse
 
 ```
-Round-robin → Cache-aware (Combined TP=1 DP=8):
-  TTFT p50: 1.836s → 0.731s  (-60%)
-  TPOT p90: 0.086s → 0.073s  (-15%)
-  E2E  p50: 6.673s → 4.480s  (-33%)
-  APC:      20.8%  → 44.7%   (+24pp)
+SeqLen=64k, Model=Qwen3-30B-A3B MoE, GPU=H20 (ridge point=37 FLOP/byte)
+
+Reuse%   NewTokens   AI (FLOP/byte)   Bound        vs Decode
+0%       64,000      40,758           COMPUTE      26,813x
+70%      19,200      20,610           COMPUTE      13,559x
+90%       6,400       8,544           COMPUTE       5,621x
+95%       3,200       4,549           COMPUTE       2,993x
+Decode        1         1.5           MEMORY            1x
 ```
 
-Cache-aware routing 的提升远大于 PD 分离的提升。
+Even at 95% reuse, prefill AI = 4549 >> ridge point 37. Prefill remains compute-bound because Q×K^T attention scales with `new_tokens × seq_len` (quadratic in context, not just new tokens).
 
-### 4.3 修复工程问题的过程
+But **absolute FLOPs** drop: 71% cache → only 29% of tokens need compute. This makes P-D interference negligible without physical separation.
 
-实验过程中发现并修复了多个 PD 分离的工程问题：
+### 4.3 The Real Bottleneck: Decode KV Cache Memory Wall
 
-| 问题 | 根因 | 修复 |
-|------|------|------|
-| Decode engine crash | vLLM scheduler assert: KV transfer 回调时 request 已 abort | Patch scheduler.py: assert → graceful skip |
-| Head-of-line blocking | Proxy 按 request count 做 LB，不区分大小请求 | Token-level ongoing_tokens load balancing |
-| "Timeout waiting for P side ready" | Proxy fire-and-forget prefill, decode 盲等 KV | Await-prefill + kv_load_failure_policy=recompute |
-| Port collision on startup | 8 Mooncake instances 同时启动争抢 torch distributed port | Staggered startup + explicit MASTER_PORT |
-| Cache routing "rich get richer" | score = ongoing - alpha*cached 导致流量集中到一个 instance | Normalized scoring: ongoing/avg_load - alpha*cache_ratio |
+PD separation concentrates all decode onto fewer GPUs:
 
-## 5. 结论
+| | Combined (8 inst) | PD-Sep 6P+2D |
+|---|---|---|
+| Decode KV cache total | 8 × 28GB = **224GB** | 2 × 28GB = **56GB** |
+| Concurrent decode reqs | ~1 per inst | ~4 per inst |
+| KV cache utilization | Low | **97.1%** |
 
-### 5.1 PD 分离为什么在 Agentic Workload 不生效
+At 97.1% KV cache usage, a 49-token request (KV = few KB) waits **114 seconds** for a 64k-token request to finish decode and release its ~8GB of KV cache.
 
-1. **Cache reuse 大幅降低 prefill 绝对工作量（71% cache hit → 只算 29%）**，使得 P-D 干扰不显著
-2. **Prefill 仍然 compute-bound**（即使 95% reuse，AI 仍 >1000），但每个请求的总 FLOPs 因 new_tokens 减少而大幅降低
-3. **Cache-aware routing 提供 "软 PD 隔离"**，效果等同于物理隔离但无 KV 传输开销
-4. **KV 传输开销不可忽略**（TTFT +72%），抵消了隔离收益
-5. **MoE 模型 active params 小**（3B），per-token compute 本身较轻
+This is **memory-capacity head-of-line blocking**: the GPU is idle (`Running: 0`), but cannot schedule new requests because KV cache is full.
 
-### 5.2 PD 分离在什么条件下有价值
+### 4.4 Why Cache-Aware Routing Matters More Than PD Separation
 
-| 条件 | Chatbot (有价值) | Agentic (无价值) |
-|------|-----------------|-----------------|
-| Cache hit rate | <10% | **71%** |
-| Model active params | 70B (dense) | **3B (MoE)** |
-| I/O ratio | 1-10x | **75.6x** |
-| Per-request prefill FLOPs | Very high | **Low (after cache)** |
-| KV transfer cost vs prefill cost | Negligible | **Significant** |
+| Change | TTFT impact | TPOT p90 impact | APC impact |
+|--------|-------------|-----------------|------------|
+| RR → cache-aware routing | **-60%** | **-15%** | **+24pp** |
+| Combined → PD-Sep | +72% | +1% | -5pp |
 
-### 5.3 Agentic Workload 应该怎么优化
+Cache-aware routing provides "soft PD isolation" by reducing per-instance prefill workload through better cache utilization, without the KV transfer overhead or decode memory wall of physical PD separation.
 
-1. **Cache-aware routing** (已验证有效): 用 ongoing_tokens + prefix_cache_hit 做联合调度，
-   将 APC 从 20.8% (RR) 提升到 44.7%，TPOT p90 降低 15%
+## 5. Conclusions
 
-2. **Cross-instance KV cache sharing**: 让多个 instance 共享全局 KV pool，
-   进一步提升 cache hit 率接近理论 71%
+1. **Single-machine PD separation is net negative for agentic workloads** due to decode KV cache memory wall
+2. **Cache-aware routing is the dominant optimization** — improves TTFT by 60%, TPOT by 15%, APC by 24pp
+3. **Prefill stays compute-bound even at 95% cache reuse**, but absolute compute drops enough to eliminate P-D interference
+4. **PD separation may help in multi-machine settings** where decode has dedicated memory pools (e.g., DRAM-backed Mooncake KV store) not limited by single-GPU HBM
 
-3. **Prefix pre-warming**: 对 cold start 请求（55%，0% cache hit），
-   预计算 common prefix (system prompt blocks) 并分发到所有 instance
+## 6. Patches Applied to vLLM 0.18.1
 
-4. **不同 workload 类型的差异化处理**:
-   - Warm 请求 (22%, >90% cache hit, avg 1.3k new tokens): 几乎免费，任何 instance 都能处理
-   - Cold 请求 (55%, 0% cache hit, avg 17.7k new tokens): prefill-heavy，需要有足够 compute
-   - 可以用 request-type-aware routing 进一步优化
+| File | Change | Reason |
+|------|--------|--------|
+| `v1/core/sched/scheduler.py` | `assert req_id in self.requests` → graceful skip | KV transfer callback races with request abort |
+
+---
+
+## Appendix: Experiment Artifacts
+
+### Data on dash0 (`~/agentic-kv/outputs/`)
+
+| Directory | Config | Requests | Notes |
+|-----------|--------|----------|-------|
+| `v18_combined_1000req` | TP=8 DP=1, 16 sess, 120s TO | 1000 | Baseline with /metrics APC |
+| `exp1_combined_tp2_dp4` | TP=2 DP=4, RR, 8 sess | 999 | No summary (killed) |
+| `exp2_combined_tp1_dp8` | TP=1 DP=8, cache-aware, 8 sess | 999 | Unified scheduler baseline |
+| `exp3_pd_sep_tp1_mooncake` | TP=1 4P+4D Mooncake, cache-aware | ~560 | Multiple iterations |
+| `gpu_ab_combined` | TP=1 DP=8 cache-aware, 200 req | 200 | GPU util CSV + metrics |
+| `gpu_ab_pdsep` | TP=1 4P+4D cache-aware, 200 req | 200 | GPU util CSV + metrics |
+| `gpu_ab_6p2d` | TP=1 6P+2D cache-aware, 200 req | 200 | Ablation 1: P/D ratio |
+| `gpu_ab_6p2d_fnf` | TP=1 6P+2D fire-and-forget, 200 req | 67 | Ablation 2: scheduling |
+| `breakdown_await` | TP=1 6P+2D await, 50 req | 50 | Per-stage breakdown |
+
+### Trace on dash0
+
+| Path | Description |
+|------|-------------|
+| `~/ali-trace/trace-glm5.1/` | Raw production logs (301GB, 4 files × 30min) |
+| `~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl` | Formatted 2h trace (2.1M requests) |
+| `~/agentic-kv/traces/sampled_1000req_seed42.jsonl` | Sampled 1000 requests for benchmarks |
+
+### Key Scripts
+
+| Script | Purpose |
+|--------|---------|
+| `scripts/cache_aware_proxy.py` | Unified global scheduler (combined + PD-sep modes) |
+| `scripts/sample_trace.py` | Trace sampler preserving sessions + hash_ids |
+| `replayer/` | Async trace replayer with streaming metrics |
+| `scripts/compute_roofline.py` | Prefill/decode roofline analysis |
+| `scripts/analyze_cache_hit.py` | Theoretical vs actual KV cache hit ratio |
+| `scripts/analyze_breakdown.py` | Per-request stage breakdown from proxy |
+| `scripts/gpu_monitor.sh` | 5s-interval GPU utilization sampling |
+
+### Reproducing
+
+```bash
+# On dash0, activate env
+cd ~/agentic-kv && source .venv/bin/activate
+
+# Sample trace
+python scripts/sample_trace.py --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
+    --output traces/sampled_1000req_seed42.jsonl --target-requests 1000 --seed 42
+
+# Combined TP=1 DP=8 + cache-aware scheduler
+for i in $(seq 0 7); do
+    MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i vllm serve $MODEL \
+        --port $((8000+i)) --tp 1 --enable-prefix-caching --enforce-eager &
+done
+python scripts/cache_aware_proxy.py --combined http://127.0.0.1:800{0..7} --port 9090
+python -m replayer --trace traces/sampled_1000req_seed42.jsonl \
+    --endpoint http://localhost:9090 --time-scale 10 --max-inflight-sessions 8
+
+# Breakdown data
+curl http://localhost:9090/breakdown | python scripts/analyze_breakdown.py /dev/stdin
+```
diff --git a/analysis/roofline_analysis.md b/analysis/roofline_analysis.md
deleted file mode 100644
index bd8c065..0000000
--- a/analysis/roofline_analysis.md
+++ /dev/null
@@ -1,130 +0,0 @@
-# Prefill 在高 KV Cache Reuse 下的计算/访存分析
-
-## Model & GPU
-
-```
-Qwen3-Coder-30B-A3B (MoE 128E top-8)
-  48 layers, hidden=2048, heads=32, kv_heads=4 (GQA), head_dim=128
-  FFN: 6144 intermediate per expert, 8 experts active per token
-  Active params: ~3B per token
-
-H20 GPU: 148 TFLOPS (BF16) / 4.0 TB/s HBM
-  Ridge point: 37 FLOP/byte
-```
-
-## 核心发现：Prefill 即使 95% reuse 仍然是 compute-bound
-
-```
-  SeqLen  Reuse%   NewTok   AI (F/B)   Bound       vs Decode AI
-   32,000      0%   32,000    23368     COMPUTE      18189x
-   32,000     70%    9,600    10045     COMPUTE       7819x
-   32,000     90%    3,200     3821     COMPUTE       2974x
-   32,000     95%    1,600     1980     COMPUTE       1541x
-
-   64,000      0%   64,000    40758     COMPUTE      26813x
-   64,000     70%   19,200    20610     COMPUTE      13559x
-   64,000     90%    6,400     8544     COMPUTE       5621x
-   64,000     95%    3,200     4549     COMPUTE       2993x
-
-  Decode (always):
-   32,000       -        1      1.3     MEMORY           1x
-   64,000       -        1      1.5     MEMORY           1x
-```
-
-**关键**：
-- Decode 的 arithmetic intensity (AI) = 1.0-1.9 — 远低于 ridge point (37)，始终 memory-bound
-- Prefill 即使 95% reuse (只有 5% 新 token)，AI 仍然 >1000 — 远高于 ridge point，依然 compute-bound
-
-## 为什么高 reuse 的 prefill 仍然是 compute-bound？
-
-### 原因：Attention 的计算量与 seq_len 成正比
-
-当有 95% cache reuse (seq_len=64k, new_tokens=3200):
-```
-  Q projection:   new_tokens × D × D     → 只处理 3200 new tokens ✓
-  K,V projection: new_tokens × D × D_kv  → 只处理 3200 new tokens ✓
-  
-  但 Attention score: new_tokens × seq_len × D_head × H × L
-                    = 3200 × 64000 × 128 × 32 × 48
-                    → 仍然要对全部 64k context 做注意力计算！
-
-  FFN (MoE):       new_tokens × 3 × D × D_ffn × 2 × K_experts × L
-                  = 3200 × 3 × 2048 × 6144 × 2 × 8 × 48
-                  → 8 个 expert 的计算量仍然很大
-```
-
-KV cache reuse 减少的是：
-- K/V projection 的计算（只算 new tokens）
-- KV 写入（只写 new tokens）
-
-但 **不减少的是**：
-- Q 对全部 context 的 attention（每个 new Q 都要和所有 64k tokens 做 attention）
-- MoE FFN 的计算（每个 new token 激活 8 个 expert）
-
-所以 prefill 的 FLOPs 虽然随 reuse 减少，但 **减少的是线性部分（投影），不减少的是二次部分（attention）**。
-在长 context 下，二次部分主导，使得即使 95% reuse，AI 仍远高于 ridge point。
-
-## Prefill 什么时候才变成 memory-bound？
-
-```
-  SeqLen=32,000: new_tokens ≈ 5-10 时 (reuse > 99.97%) → AI ≈ 37
-  SeqLen=64,000: new_tokens ≈ 5-10 时 → AI ≈ 37
-```
-
-只有在 **近乎 100% reuse**（仅 5-10 个 new tokens）时，prefill 才接近 memory-bound。
-在实际 agentic trace 中，只有 3% 的请求达到这个程度。
-
-## 对 PD 分离的影响：修正之前的分析
-
-### 之前的错误结论（已修正）
-> "Prefill 大部分是 cache lookup 不是 compute"
-
-这是 **错误的**。即使 70% cache reuse，prefill 的 AI 仍然是 decode 的 7000-14000 倍。
-Prefill 始终是 compute-bound，decode 始终是 memory-bound。
-
-### 那为什么 PD 分离在我们的实验中没有帮助？
-
-正确的解释不是 "prefill 变成了 memory-bound"，而是：
-
-**1. Cache reuse 大幅减少了 prefill 的绝对计算量**
-```
-  无 cache: avg 33.6k tokens × prefill compute = X FLOPs
-  71% cache: avg 9.4k tokens × prefill compute = 0.28X FLOPs
-```
-虽然 prefill 仍是 compute-bound，但 **总工作量只有原来的 28%**。
-在 8 instance 并行 + cache-aware routing 下，每个 instance 的 prefill 负载非常轻，
-不足以产生对 decode 的显著干扰。
-
-**2. MoE 模型的 per-token compute 本身较小**
-Active params 只有 3B（全参数的 10%），单个 token 的计算量不大。
-对比 Dense 70B 模型，同样的 GPU 上 prefill-decode 干扰会严重得多。
-
-**3. Cache-aware routing 的 "负载均衡" 效应**
-当请求被路由到 cache 命中率高的 instance 时，该 instance 的实际 prefill 工作量更小，
-自然减少了 P-D 争抢。这相当于 routing 层面的 "软 PD 分离"。
-
-## 对比不同 workload 类型的 roofline 特征
-
-```
-                          Prefill AI    Decode AI    PD-Sep 价值
-  Dense 70B, Chatbot:     200-1000x      1-2x        HIGH (compute-heavy P 干扰 D)
-  Dense 70B, Agent:       100-500x       1-2x        MEDIUM (cache reduces P load)  
-  MoE 30B, Chatbot:       100-500x       1-2x        MEDIUM
-  MoE 30B, Agent:         50-200x        1-2x        LOW (small active params + cache)
-  ← 我们的位置
-```
-
-**PD 分离的 ROI 随着 cache hit 率升高和模型 active params 减少而下降。**
-Agentic MoE 模型恰好在两个方面都不利于 PD 分离。
-
-## 实际 trace 的 prefill bound 分布
-
-```
-  With actual trace prefix cache pattern (1000 sampled requests):
-    Compute-bound prefills: 961 (96%)
-    Memory-bound prefills:  37 (3%)    ← 近 100% reuse 的 warm 请求
-    (Decode is ALWAYS memory-bound)
-```
-
-96% 的 prefill 仍然是 compute-bound，但 **absolute compute 因 cache 大幅降低**。
-这是一个 "compute-bound but lightweight" 的独特状态 —— bound 类型没变，但强度大幅降低。