Agentic workload PD separation analysis with trace-driven benchmarks

Systematic study of prefill-decode disaggregation for agentic LLM workloads using production GLM-5.1 coder trace (2.1M requests, 71B input tokens). Key findings: - Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7% without PD separation, matching PD-Sep's decode isolation benefit - PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain when using the same cache-aware scheduler - Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x vs decode AI <2), but absolute FLOPs drop 71% from cache hits - For agentic MoE workloads, cache-aware routing > PD separation Infrastructure: - Trace sampler preserving session structure + hash_ids for prefix sharing - Async trace replayer with streaming TTFT/TPOT/E2E measurement - Unified cache-aware + token-level load-balanced global scheduler proxy supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes - vLLM 0.18.1 scheduler patch for KV transfer abort race condition - Roofline analysis tool for prefill/decode compute characterization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 21:21:57 +08:00
commit 05592e6adc
22 changed files with 2837 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,8 @@
+__pycache__/
+*.pyc
+.venv/
+*.egg-info/
+outputs/
+traces/
+third_party/
+*.log
--- a/TODO.md
+++ b/TODO.md
@@ -0,0 +1,25 @@
+实验 setup：
+
+GPU 机器：dash0，是 8*H20 的机器，可以直接 `ssh dash0` 进行连接访问
+
+推理引擎：基于 vllm 0.18.1，self build，支持后续 patch 放在 git 中维护
+
+模型：`~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`
+
+推理 trace：原始完整 2h trace 在 dash0 的 `~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl`
+
+
+
+性能指标：每个请求的 TTFT/TPOT/E2E/TBT，prefix KVCache hit ratio
+
+
+
+目标：
+
+1. 先实现标准的 trace-sampler，将 cluster 规模的原始 trace，sample 到合适当前机器数量的规模来跑，保持一份统一的 sample 后的 trace file 作为输入
+2. 实现标准的 trace replayer，保证能够体现线上流量的流量到来特征，KVCache 重用特征等
+3. 跑通 PD 分离，确认 PD 分离能够比普通的 PD 混合一起跑的性能要好，给出两者详细的性能对比以及原因分析
+4. 判断 trace 的 pattern，是否有必要 PD 完全混合或者 PD 分离
+5. 参考本地的 `~/phd/agentic-pd-hybrid`，判断是否能够实现一套 prefill-as-a-service 的架构，把重的 prefill 交给 prefill service，prefill service 能够从本地 GPU/DRAM/别的 GPU 机器上 pull KVCache，提高本地的 prefix KVCache hit ratio，不影响 decoding 的 prefill，就可以交给过去 PD 分离定义中 D-node 来做，提高 KVCache 命中率
+
+
--- a/analysis/pd_separation_analysis.md
+++ b/analysis/pd_separation_analysis.md
@@ -0,0 +1,289 @@
+# PD 分离在 Agentic Workload 下的系统分析
+
+## 1. Trace 特征 (GLM-5.1 Agentic Coder, 2h, 2.1M requests)
+
+```
+Total requests:  2,114,220
+Input tokens:    71.1B (avg 33.6k/req, p50=20k, p90=88k)
+Output tokens:   940M  (avg 445/req, p50=80, p90=811)
+I/O ratio:       75.6x (aggregate), 217.8x (per-req median)
+Prefill share:   98% of total tokens
+Sessions:        1.3M (90% single-turn, 9% multi-turn)
+```
+
+**与传统 chatbot workload 的根本区别：**
+
+| 特征 | Traditional Chatbot | Agentic Coder (GLM-5.1) |
+|------|-------------------|------------------------|
+| I/O ratio | 1-10x | **75.6x** |
+| Input p50 | 500-2000 tokens | **20,030 tokens** |
+| Output p50 | 200-500 tokens | **80 tokens** |
+| Prefill token share | 50-80% | **98%** |
+| >32k input | <5% | **38%** |
+| Multi-turn | 50-80% | **9%** |
+
+**KV Cache 复用特征：**
+
+```
+Unique hash blocks:        20,650,883
+Shared blocks (ref>1):     9,749,379 (47%)
+Highly shared (ref>10):    2,428,160
+Intra-session reuse:       57%
+Top-10 blocks ref count:   64,754 (system prompt blocks)
+Theoretical cache hit:     71% (infinite cache, first 100k requests)
+```
+
+**Input length 分布与 token 占比：**
+
+```
+     <1k:   202,396 reqs ( 9%)       89M tokens ( 0%)
+    1-8k:   380,009 reqs (17%)     1.6B tokens ( 2%)
+   8-32k:   720,871 reqs (34%)    12.7B tokens (17%)
+  32-65k:   405,371 reqs (19%)    19.4B tokens (27%)
+ 65-131k:   394,014 reqs (18%)    35.7B tokens (50%)
+   >131k:    11,559 reqs ( 0%)     1.6B tokens ( 2%)
+```
+
+50% 的 token 计算量来自 65-131k 的长 context 请求。
+
+## 2. DistServe 等 PD 分离的核心假设
+
+DistServe (OSDI'24), Splitwise, TetriInfer 等 PD 分离工作基于以下假设：
+
+### 假设 A: Prefill 和 Decode 有不同的计算特征
+- **Prefill**: compute-bound, 高 GPU 利用率, batch 越大越好
+- **Decode**: memory-bandwidth-bound, 低 GPU 利用率, latency-sensitive
+
+**在 agentic workload 中的验证**: ✅ 成立，但需要细化
+
+Roofline 分析显示（详见 Section 5）：
+
+```
+                    Arithmetic Intensity (FLOP/byte)
+  Decode:           1.0 - 1.9     (memory-bound, 始终远低于 ridge point)
+  Prefill 0% reuse: 23,000-72,000 (strongly compute-bound)
+  Prefill 70% reuse: 10,000-42,000 (仍然 compute-bound!)
+  Prefill 95% reuse: 1,900-10,800  (仍然 compute-bound!)
+  Ridge point (H20): 37
+```
+
+**即使 95% KV cache reuse，prefill 仍然是 compute-bound。** 但绝对计算量大幅减少。
+
+### 假设 B: PD co-location 导致互相干扰
+- Prefill 的大 batch 计算会抢占 GPU 资源，导致 decode 的 TPOT 升高
+- Decode 的持续小计算会占用 GPU 调度槽位，影响 prefill 吞吐
+
+**在 agentic workload 中的验证**: ⚠️ 干扰存在，但 **可被 cache-aware routing 消除**
+
+```
+同一 cache-aware scheduler, TP=1, 8 GPU:
+  Combined TP=1 DP=8:  TPOT p90 = 0.073s
+  PD-Sep TP=1 4P+4D:   TPOT p90 = 0.074s
+  → 差异 <2%, 不显著
+```
+
+对比 round-robin routing:
+```
+  Combined TP=1 DP=8 (RR):          TPOT p90 = 0.086s
+  Combined TP=1 DP=8 (cache-aware): TPOT p90 = 0.073s  → -15%
+  → routing 改善 > PD 分离改善
+```
+
+**原因**: cache-aware routing 让 high-cache-hit 的请求集中到特定 instance，
+每个 instance 的实际 prefill 新 token 数大幅减少（71% 被 cache），
+prefill-decode 干扰因 prefill 工作量降低而自然缓解。
+
+### 假设 C: KV Cache 传输开销可以忽略
+- DistServe 假设 P→D 的 KV 传输延迟远小于 prefill 计算时间
+- 在 InfiniBand/NVLink 等高带宽互联下成立
+
+**在 agentic workload 中的验证**: ❌ 不成立
+
+```
+PD-Sep TTFT p50 = 1.261s  vs  Combined TTFT p50 = 0.731s  (+72%)
+```
+
+原因：
+1. Agentic workload 的 input 极长（p50=20k, p90=88k tokens），KV cache 很大
+2. 单请求 KV cache = 20k tokens × 48 layers × 2(K+V) × 512 bytes ≈ 1GB
+3. 更重要的是 await-prefill 链路的串行延迟：proxy → prefill → KV transfer → decode → first token
+
+### 假设 D: 专用 prefill 节点可以提高 prefill 吞吐
+- Prefill 节点不做 decode，GPU 利用率更高
+- 可以用更大的 batch size
+
+**在 agentic workload 中的验证**: ⚠️ 收益被 cache 稀释
+
+```
+理论 prefix cache hit (infinite cache): 71% of input tokens
+实际 APC (Combined, cache-aware, 8 inst): 44.7%
+```
+
+71% cache hit → 只有 29% 的 input tokens 需要实际 prefill compute。
+Nominal avg input 33.6k → Actual avg new prefill ~9.7k tokens。
+专用 prefill 的 GPU 利用率优势因 prefill 工作量降低而缩小。
+
+## 3. Roofline 分析：Prefill 在高 Cache Reuse 下的计算/访存特性
+
+### 3.1 模型计算结构
+
+```
+Qwen3-Coder-30B-A3B (MoE 128E top-8):
+  48 layers, hidden=2048, heads=32, kv_heads=4 (GQA), head_dim=128
+  FFN: 6144 intermediate per expert, 8 experts active per token
+  Active params per token: ~3B
+
+H20 GPU: 148 TFLOPS (BF16), 4.0 TB/s HBM → Ridge point: 37 FLOP/byte
+```
+
+### 3.2 Decode 永远 memory-bound
+
+```
+SeqLen    FLOP        Bytes       AI (F/B)    Bound
+1,000     3.04e+10    3.01e+10    1.0         MEMORY
+16,000    3.63e+10    3.16e+10    1.1         MEMORY
+64,000    5.52e+10    3.63e+10    1.5         MEMORY
+128,000   8.03e+10    4.26e+10    1.9         MEMORY
+```
+
+Decode 的 AI 始终 < 2，远低于 ridge point (37)。每个 decode step 只处理 1 个 token，
+计算量极小，瓶颈在于读取模型权重和全量 KV cache。
+
+### 3.3 Prefill 即使 95% reuse 仍然 compute-bound
+
+```
+SeqLen   Reuse%  NewTok    AI (F/B)    Bound       vs Decode
+32,000      0%   32,000     23,368     COMPUTE     18,190x
+32,000     50%   16,000     14,899     COMPUTE     11,597x
+32,000     70%    9,600     10,045     COMPUTE      7,819x
+32,000     90%    3,200      3,821     COMPUTE      2,974x
+32,000     95%    1,600      1,980     COMPUTE      1,542x
+
+64,000      0%   64,000     40,758     COMPUTE     26,813x
+64,000     70%   19,200     20,610     COMPUTE     13,559x
+64,000     90%    6,400      8,544     COMPUTE      5,621x
+64,000     95%    3,200      4,549     COMPUTE      2,993x
+```
+
+### 3.4 为什么高 reuse 不改变 compute-bound 性质
+
+KV cache reuse 减少的：
+- K/V projection 计算（只算 new tokens）
+- KV 写入（只写 new tokens）
+
+KV cache reuse **不减少**的：
+- **Q×K^T attention**: 每个 new Q 都要和全部 seq_len 个 KV 做 attention
+  ```
+  FLOPs = new_tokens × seq_len × head_dim × num_heads × 2 × num_layers
+  ```
+  At 95% reuse, 32k seq: 1600 × 32000 × 128 × 32 × 2 × 48 ≈ 2×10^13
+  这个二次项在长 context 下主导总计算量
+
+- **MoE FFN**: 每个 new token 激活 8 experts
+  ```
+  FLOPs = new_tokens × 3 × D × D_ffn × 2 × K_experts × num_layers
+  ```
+
+**Prefill 只在接近 100% reuse (< 10 new tokens) 时才变成 memory-bound。**
+
+### 3.5 Prefill 什么时候变 memory-bound
+
+```
+SeqLen=32,000: new_tokens ≈ 5-10 时 → AI ≈ 37 (ridge point)
+SeqLen=64,000: new_tokens ≈ 5-10 时 → AI ≈ 37
+```
+
+在实际 agentic trace 中：
+```
+Compute-bound prefills: 961 (96%)
+Memory-bound prefills:  37 (3%)   ← 近 100% reuse 的极端 warm 请求
+```
+
+### 3.6 关键洞察："Compute-bound but lightweight"
+
+高 cache reuse 下的 prefill 处于一种独特状态：
+
+```
+  Prefill bound 类型:  Compute-bound (不变)
+  Prefill 绝对工作量:  大幅降低 (71% cache → 只算 29% 的 tokens)
+  Prefill-Decode 干扰: 因绝对工作量降低而减轻 (不需要物理隔离)
+```
+
+这解释了为什么 PD 分离没有帮助：
+- PD 分离解决的是 "prefill 太重干扰 decode" 的问题
+- 但 cache-aware routing 已经把 prefill 的实际工作量降到足够轻
+- 物理隔离（PD 分离）的收益被 KV 传输开销抵消
+
+## 4. 实验结果
+
+### 4.1 完整实验矩阵
+
+所有实验使用统一的 cache-aware + token-level load-balanced global scheduler。
+
+| Config | OK/N | TTFT p50 | TPOT p90 | E2E p50 | APC |
+|--------|------|----------|----------|---------|-----|
+| TP=8 DP=1 (single instance) | 998/1000 | 0.467s | 0.129s | 3.30s | 53.0% |
+| TP=2 DP=4 (4 inst, RR) | 997/999 | 0.844s | 0.095s | 4.92s | 33.5% |
+| TP=1 DP=8 (8 inst, RR) | 997/999 | 1.836s | 0.086s | 6.67s | 20.8% |
+| **TP=1 DP=8 (cache-aware)** | **997/999** | **0.731s** | **0.073s** | **4.48s** | **44.7%** |
+| TP=1 PD-Sep 4P+4D (cache-aware) | 509/564 | 1.261s | 0.074s | 5.61s | 40.2% |
+
+### 4.2 Cache-Aware Routing 的效果
+
+```
+Round-robin → Cache-aware (Combined TP=1 DP=8):
+  TTFT p50: 1.836s → 0.731s  (-60%)
+  TPOT p90: 0.086s → 0.073s  (-15%)
+  E2E  p50: 6.673s → 4.480s  (-33%)
+  APC:      20.8%  → 44.7%   (+24pp)
+```
+
+Cache-aware routing 的提升远大于 PD 分离的提升。
+
+### 4.3 修复工程问题的过程
+
+实验过程中发现并修复了多个 PD 分离的工程问题：
+
+| 问题 | 根因 | 修复 |
+|------|------|------|
+| Decode engine crash | vLLM scheduler assert: KV transfer 回调时 request 已 abort | Patch scheduler.py: assert → graceful skip |
+| Head-of-line blocking | Proxy 按 request count 做 LB，不区分大小请求 | Token-level ongoing_tokens load balancing |
+| "Timeout waiting for P side ready" | Proxy fire-and-forget prefill, decode 盲等 KV | Await-prefill + kv_load_failure_policy=recompute |
+| Port collision on startup | 8 Mooncake instances 同时启动争抢 torch distributed port | Staggered startup + explicit MASTER_PORT |
+| Cache routing "rich get richer" | score = ongoing - alpha*cached 导致流量集中到一个 instance | Normalized scoring: ongoing/avg_load - alpha*cache_ratio |
+
+## 5. 结论
+
+### 5.1 PD 分离为什么在 Agentic Workload 不生效
+
+1. **Cache reuse 大幅降低 prefill 绝对工作量（71% cache hit → 只算 29%）**，使得 P-D 干扰不显著
+2. **Prefill 仍然 compute-bound**（即使 95% reuse，AI 仍 >1000），但每个请求的总 FLOPs 因 new_tokens 减少而大幅降低
+3. **Cache-aware routing 提供 "软 PD 隔离"**，效果等同于物理隔离但无 KV 传输开销
+4. **KV 传输开销不可忽略**（TTFT +72%），抵消了隔离收益
+5. **MoE 模型 active params 小**（3B），per-token compute 本身较轻
+
+### 5.2 PD 分离在什么条件下有价值
+
+| 条件 | Chatbot (有价值) | Agentic (无价值) |
+|------|-----------------|-----------------|
+| Cache hit rate | <10% | **71%** |
+| Model active params | 70B (dense) | **3B (MoE)** |
+| I/O ratio | 1-10x | **75.6x** |
+| Per-request prefill FLOPs | Very high | **Low (after cache)** |
+| KV transfer cost vs prefill cost | Negligible | **Significant** |
+
+### 5.3 Agentic Workload 应该怎么优化
+
+1. **Cache-aware routing** (已验证有效): 用 ongoing_tokens + prefix_cache_hit 做联合调度，
+   将 APC 从 20.8% (RR) 提升到 44.7%，TPOT p90 降低 15%
+
+2. **Cross-instance KV cache sharing**: 让多个 instance 共享全局 KV pool，
+   进一步提升 cache hit 率接近理论 71%
+
+3. **Prefix pre-warming**: 对 cold start 请求（55%，0% cache hit），
+   预计算 common prefix (system prompt blocks) 并分发到所有 instance
+
+4. **不同 workload 类型的差异化处理**:
+   - Warm 请求 (22%, >90% cache hit, avg 1.3k new tokens): 几乎免费，任何 instance 都能处理
+   - Cold 请求 (55%, 0% cache hit, avg 17.7k new tokens): prefill-heavy，需要有足够 compute
+   - 可以用 request-type-aware routing 进一步优化
--- a/analysis/roofline_analysis.md
+++ b/analysis/roofline_analysis.md
@@ -0,0 +1,130 @@
+# Prefill 在高 KV Cache Reuse 下的计算/访存分析
+
+## Model & GPU
+
+```
+Qwen3-Coder-30B-A3B (MoE 128E top-8)
+  48 layers, hidden=2048, heads=32, kv_heads=4 (GQA), head_dim=128
+  FFN: 6144 intermediate per expert, 8 experts active per token
+  Active params: ~3B per token
+
+H20 GPU: 148 TFLOPS (BF16) / 4.0 TB/s HBM
+  Ridge point: 37 FLOP/byte
+```
+
+## 核心发现：Prefill 即使 95% reuse 仍然是 compute-bound
+
+```
+  SeqLen  Reuse%   NewTok   AI (F/B)   Bound       vs Decode AI
+   32,000      0%   32,000    23368     COMPUTE      18189x
+   32,000     70%    9,600    10045     COMPUTE       7819x
+   32,000     90%    3,200     3821     COMPUTE       2974x
+   32,000     95%    1,600     1980     COMPUTE       1541x
+
+   64,000      0%   64,000    40758     COMPUTE      26813x
+   64,000     70%   19,200    20610     COMPUTE      13559x
+   64,000     90%    6,400     8544     COMPUTE       5621x
+   64,000     95%    3,200     4549     COMPUTE       2993x
+
+  Decode (always):
+   32,000       -        1      1.3     MEMORY           1x
+   64,000       -        1      1.5     MEMORY           1x
+```
+
+**关键**：
+- Decode 的 arithmetic intensity (AI) = 1.0-1.9 — 远低于 ridge point (37)，始终 memory-bound
+- Prefill 即使 95% reuse (只有 5% 新 token)，AI 仍然 >1000 — 远高于 ridge point，依然 compute-bound
+
+## 为什么高 reuse 的 prefill 仍然是 compute-bound？
+
+### 原因：Attention 的计算量与 seq_len 成正比
+
+当有 95% cache reuse (seq_len=64k, new_tokens=3200):
+```
+  Q projection:   new_tokens × D × D     → 只处理 3200 new tokens ✓
+  K,V projection: new_tokens × D × D_kv  → 只处理 3200 new tokens ✓
+  
+  但 Attention score: new_tokens × seq_len × D_head × H × L
+                    = 3200 × 64000 × 128 × 32 × 48
+                    → 仍然要对全部 64k context 做注意力计算！
+
+  FFN (MoE):       new_tokens × 3 × D × D_ffn × 2 × K_experts × L
+                  = 3200 × 3 × 2048 × 6144 × 2 × 8 × 48
+                  → 8 个 expert 的计算量仍然很大
+```
+
+KV cache reuse 减少的是：
+- K/V projection 的计算（只算 new tokens）
+- KV 写入（只写 new tokens）
+
+但 **不减少的是**：
+- Q 对全部 context 的 attention（每个 new Q 都要和所有 64k tokens 做 attention）
+- MoE FFN 的计算（每个 new token 激活 8 个 expert）
+
+所以 prefill 的 FLOPs 虽然随 reuse 减少，但 **减少的是线性部分（投影），不减少的是二次部分（attention）**。
+在长 context 下，二次部分主导，使得即使 95% reuse，AI 仍远高于 ridge point。
+
+## Prefill 什么时候才变成 memory-bound？
+
+```
+  SeqLen=32,000: new_tokens ≈ 5-10 时 (reuse > 99.97%) → AI ≈ 37
+  SeqLen=64,000: new_tokens ≈ 5-10 时 → AI ≈ 37
+```
+
+只有在 **近乎 100% reuse**（仅 5-10 个 new tokens）时，prefill 才接近 memory-bound。
+在实际 agentic trace 中，只有 3% 的请求达到这个程度。
+
+## 对 PD 分离的影响：修正之前的分析
+
+### 之前的错误结论（已修正）
+> "Prefill 大部分是 cache lookup 不是 compute"
+
+这是 **错误的**。即使 70% cache reuse，prefill 的 AI 仍然是 decode 的 7000-14000 倍。
+Prefill 始终是 compute-bound，decode 始终是 memory-bound。
+
+### 那为什么 PD 分离在我们的实验中没有帮助？
+
+正确的解释不是 "prefill 变成了 memory-bound"，而是：
+
+**1. Cache reuse 大幅减少了 prefill 的绝对计算量**
+```
+  无 cache: avg 33.6k tokens × prefill compute = X FLOPs
+  71% cache: avg 9.4k tokens × prefill compute = 0.28X FLOPs
+```
+虽然 prefill 仍是 compute-bound，但 **总工作量只有原来的 28%**。
+在 8 instance 并行 + cache-aware routing 下，每个 instance 的 prefill 负载非常轻，
+不足以产生对 decode 的显著干扰。
+
+**2. MoE 模型的 per-token compute 本身较小**
+Active params 只有 3B（全参数的 10%），单个 token 的计算量不大。
+对比 Dense 70B 模型，同样的 GPU 上 prefill-decode 干扰会严重得多。
+
+**3. Cache-aware routing 的 "负载均衡" 效应**
+当请求被路由到 cache 命中率高的 instance 时，该 instance 的实际 prefill 工作量更小，
+自然减少了 P-D 争抢。这相当于 routing 层面的 "软 PD 分离"。
+
+## 对比不同 workload 类型的 roofline 特征
+
+```
+                          Prefill AI    Decode AI    PD-Sep 价值
+  Dense 70B, Chatbot:     200-1000x      1-2x        HIGH (compute-heavy P 干扰 D)
+  Dense 70B, Agent:       100-500x       1-2x        MEDIUM (cache reduces P load)  
+  MoE 30B, Chatbot:       100-500x       1-2x        MEDIUM
+  MoE 30B, Agent:         50-200x        1-2x        LOW (small active params + cache)
+  ← 我们的位置
+```
+
+**PD 分离的 ROI 随着 cache hit 率升高和模型 active params 减少而下降。**
+Agentic MoE 模型恰好在两个方面都不利于 PD 分离。
+
+## 实际 trace 的 prefill bound 分布
+
+```
+  With actual trace prefix cache pattern (1000 sampled requests):
+    Compute-bound prefills: 961 (96%)
+    Memory-bound prefills:  37 (3%)    ← 近 100% reuse 的 warm 请求
+    (Decode is ALWAYS memory-bound)
+```
+
+96% 的 prefill 仍然是 compute-bound，但 **absolute compute 因 cache 大幅降低**。
+这是一个 "compute-bound but lightweight" 的独特状态 —— bound 类型没变，但强度大幅降低。
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -0,0 +1,16 @@
+[project]
+name = "agentic-kv"
+version = "0.1.0"
+description = "Trace-driven KV cache benchmarking for agentic LLM workloads"
+requires-python = ">=3.10"
+dependencies = [
+    "httpx>=0.27",
+    "numpy>=1.24",
+]
+
+[project.optional-dependencies]
+dev = ["pytest"]
+
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
--- a/replayer/init.py
+++ b/replayer/init.py
--- a/replayer/main.py
+++ b/replayer/main.py
@@ -0,0 +1,55 @@
+"""CLI entry point: python -m replayer replay ..."""
+
+from __future__ import annotations
+
+import argparse
+import asyncio
+import logging
+from pathlib import Path
+
+from .replay import ReplayConfig, replay_trace
+
+
+def main() -> None:
+    p = argparse.ArgumentParser(description="Trace replayer for vLLM benchmarking")
+    p.add_argument("--trace", type=Path, required=True, help="Sampled trace JSONL")
+    p.add_argument("--output", type=Path, required=True, help="Output metrics JSONL")
+    p.add_argument("--endpoint", type=str, required=True,
+                   help="vLLM server URL (e.g. http://localhost:8000)")
+    p.add_argument("--model", type=str, default="default", help="Model name for API")
+    p.add_argument("--time-scale", type=float, default=1.0,
+                   help="Time compression (>1 = faster)")
+    p.add_argument("--max-inflight-sessions", type=int, default=32)
+    p.add_argument("--concurrency-limit", type=int, default=256)
+    p.add_argument("--request-timeout", type=float, default=600.0)
+    p.add_argument("--request-limit", type=int, default=None,
+                   help="Limit number of requests to replay")
+    p.add_argument("-v", "--verbose", action="store_true")
+    args = p.parse_args()
+
+    logging.basicConfig(
+        level=logging.DEBUG if args.verbose else logging.INFO,
+        format="%(asctime)s %(levelname)s %(name)s: %(message)s",
+    )
+
+    config = ReplayConfig(
+        trace_path=args.trace,
+        output_path=args.output,
+        endpoint_url=args.endpoint.rstrip("/"),
+        model_name=args.model,
+        time_scale=args.time_scale,
+        max_inflight_sessions=args.max_inflight_sessions,
+        concurrency_limit=args.concurrency_limit,
+        request_timeout_s=args.request_timeout,
+        request_limit=args.request_limit,
+    )
+
+    results = asyncio.run(replay_trace(config))
+    succeeded = sum(1 for r in results if r.error is None)
+    print(f"\nDone: {succeeded}/{len(results)} requests succeeded")
+    print(f"Metrics: {args.output}")
+    print(f"Summary: {args.output.with_suffix('.summary.json')}")
+
+
+if __name__ == "__main__":
+    main()
--- a/replayer/metrics.py
+++ b/replayer/metrics.py
@@ -0,0 +1,107 @@
+"""Per-request metrics collection and summary reporting."""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import statistics
+from dataclasses import asdict, dataclass
+from pathlib import Path
+from typing import Any
+
+
+@dataclass(frozen=True)
+class RequestMetrics:
+    request_id: str
+    session_id: str
+    turn_id: int
+    trace_timestamp_s: float
+    input_length: int
+    output_length: int
+    request_type: str
+    effective_input_length: int | None
+    cached_tokens: int
+    latency_s: float | None
+    ttft_s: float | None
+    tpot_s: float | None
+    actual_output_tokens: int | None = None
+    requested_output_tokens: int | None = None
+    finish_reason: str | None = None
+    error: str | None = None
+
+
+class IncrementalMetricSink:
+    """Append each RequestMetrics to JSONL immediately (crash-safe)."""
+
+    def __init__(self, path: Path):
+        self.path = path
+        path.parent.mkdir(parents=True, exist_ok=True)
+        path.write_text("")
+        self._lock = asyncio.Lock()
+        self._fh = path.open("a", encoding="utf-8", buffering=1)
+
+    async def append(self, metric: RequestMetrics) -> None:
+        line = json.dumps(asdict(metric), sort_keys=True) + "\n"
+        async with self._lock:
+            self._fh.write(line)
+            self._fh.flush()
+
+    def close(self) -> None:
+        try:
+            self._fh.flush()
+            self._fh.close()
+        except Exception:
+            pass
+
+
+def write_summary_json(path: Path, rows: list[RequestMetrics]) -> None:
+    successful = [r for r in rows if r.error is None]
+    latencies = [r.latency_s for r in successful if r.latency_s is not None]
+    ttfts = [r.ttft_s for r in successful if r.ttft_s is not None]
+    tpots = [r.tpot_s for r in successful if r.tpot_s is not None]
+
+    total_input = sum(r.input_length for r in successful)
+    total_cached = sum(r.cached_tokens for r in successful)
+
+    summary: dict[str, Any] = {
+        "request_count": len(rows),
+        "success_count": len(successful),
+        "error_count": sum(1 for r in rows if r.error is not None),
+        "latency_stats_s": _stats(latencies),
+        "ttft_stats_s": _stats(ttfts),
+        "tpot_stats_s": _stats(tpots),
+        "cache_hit_request_count": sum(1 for r in successful if r.cached_tokens > 0),
+        "total_input_tokens": total_input,
+        "total_cached_tokens": total_cached,
+        "prefix_cache_hit_ratio": total_cached / total_input if total_input > 0 else 0.0,
+        "cached_tokens_stats": _stats([float(r.cached_tokens) for r in successful]),
+        "actual_output_tokens_stats": _stats(
+            [float(r.actual_output_tokens) for r in successful
+             if r.actual_output_tokens is not None]
+        ),
+    }
+
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with path.open("w", encoding="utf-8") as fh:
+        json.dump(summary, fh, indent=2, sort_keys=True)
+
+
+def _stats(values: list[float | None]) -> dict[str, float] | None:
+    clean = [v for v in values if v is not None]
+    if not clean:
+        return None
+    clean.sort()
+    return {
+        "count": float(len(clean)),
+        "mean": statistics.fmean(clean),
+        "p50": _percentile(clean, 0.50),
+        "p90": _percentile(clean, 0.90),
+        "p99": _percentile(clean, 0.99),
+    }
+
+
+def _percentile(sorted_vals: list[float], pct: float) -> float:
+    if len(sorted_vals) == 1:
+        return sorted_vals[0]
+    idx = round((len(sorted_vals) - 1) * pct)
+    return sorted_vals[idx]
--- a/replayer/replay.py
+++ b/replayer/replay.py
@@ -0,0 +1,343 @@
+"""Trace replayer — send requests to vLLM following trace timing.
+
+Supports both vLLM's /v1/completions (OpenAI-compatible) and /generate
+(SGLang-style) endpoints. Uses hash_ids from the trace to construct
+synthetic prompts that reproduce realistic prefix-cache hit patterns.
+
+Key behaviors:
+  - Per-session sequencing: turns within a session are sent in order,
+    each waiting for the previous to complete before dispatching.
+  - Inter-session arrival: sessions start at their trace timestamps,
+    scaled by --time-scale.
+  - Concurrency control: --max-inflight-sessions caps concurrent sessions;
+    --concurrency-limit caps total in-flight requests.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import logging
+import time
+from collections import defaultdict
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+import random as _random
+
+import httpx
+
+from .metrics import IncrementalMetricSink, RequestMetrics, write_summary_json
+from .trace import TraceRequest, load_trace
+
+logger = logging.getLogger(__name__)
+
+BLOCK_SIZE = 512
+VOCAB_SIZE = 151936
+TOKEN_RANGE_START = 100
+TOKEN_RANGE_END = VOCAB_SIZE - 100
+
+_block_cache: dict[int, list[int]] = {}
+
+
+def _hash_id_to_token_ids(hash_id: int) -> list[int]:
+    """Deterministically map a hash_id to BLOCK_SIZE token IDs."""
+    if hash_id in _block_cache:
+        return _block_cache[hash_id]
+    rng = _random.Random(hash_id)
+    ids = [rng.randint(TOKEN_RANGE_START, TOKEN_RANGE_END) for _ in range(BLOCK_SIZE)]
+    _block_cache[hash_id] = ids
+    return ids
+
+
+@dataclass
+class ReplayConfig:
+    trace_path: Path
+    output_path: Path
+    endpoint_url: str  # comma-separated for round-robin: "http://host:8000,http://host:8001"
+    time_scale: float = 1.0
+    max_inflight_sessions: int = 32
+    concurrency_limit: int = 256
+    request_timeout_s: float = 600.0
+    request_limit: int | None = None
+    model_name: str = "default"
+
+
+def _build_prompt_token_ids(req: TraceRequest) -> list[int]:
+    """Build token IDs from hash_ids for prefix-cache-aware replay.
+
+    Same hash_id prefix → same token ID prefix → APC cache hit in vLLM.
+    """
+    ids: list[int] = []
+    for hid in req.hash_ids:
+        ids.extend(_hash_id_to_token_ids(hid))
+    # Pad to input_length with deterministic tokens
+    pad_rng = _random.Random(req.chat_id)
+    while len(ids) < req.input_length:
+        ids.append(pad_rng.randint(TOKEN_RANGE_START, TOKEN_RANGE_END))
+    return ids[:req.input_length]
+
+
+@dataclass
+class _SessionState:
+    session_id: str
+    turns: list[TraceRequest]
+    metrics: list[RequestMetrics] = field(default_factory=list)
+
+
+_endpoint_counter = 0
+
+
+def _pick_endpoint(config: ReplayConfig) -> str:
+    """Round-robin across comma-separated endpoints."""
+    global _endpoint_counter
+    endpoints = [e.strip() for e in config.endpoint_url.split(",")]
+    url = endpoints[_endpoint_counter % len(endpoints)]
+    _endpoint_counter += 1
+    return url
+
+
+async def _dispatch_request(
+    *,
+    client: httpx.AsyncClient,
+    config: ReplayConfig,
+    req: TraceRequest,
+    prompt_token_ids: list[int],
+    sem: asyncio.Semaphore,
+) -> RequestMetrics:
+    """Send one request via /v1/completions (streaming) and collect metrics."""
+    endpoint = _pick_endpoint(config)
+    payload = {
+        "model": config.model_name,
+        "prompt": prompt_token_ids,
+        "max_tokens": max(1, req.output_length),
+        "temperature": 0,
+        "stream": True,
+        "stream_options": {"include_usage": True},
+    }
+
+    start = time.perf_counter()
+    ttft_s = None
+    n_output = 0
+    cached_tokens = 0
+    finish_reason = None
+    err = None
+    token_times: list[float] = []
+
+    async with sem:
+        try:
+            async with client.stream(
+                "POST",
+                f"{endpoint}/v1/completions",
+                json=payload,
+                timeout=config.request_timeout_s,
+            ) as resp:
+                resp.raise_for_status()
+                async for raw_line in resp.aiter_lines():
+                    if not raw_line or not raw_line.startswith("data:"):
+                        continue
+                    data = raw_line[5:].strip()
+                    if data == "[DONE]":
+                        break
+                    try:
+                        chunk = json.loads(data)
+                    except json.JSONDecodeError:
+                        continue
+
+                    now = time.perf_counter()
+                    if ttft_s is None:
+                        ttft_s = now - start
+
+                    choices = chunk.get("choices", [])
+                    if choices:
+                        delta = choices[0].get("text", "")
+                        if delta:
+                            token_times.append(now)
+                        fr = choices[0].get("finish_reason")
+                        if fr:
+                            finish_reason = fr
+
+                    usage = chunk.get("usage")
+                    if usage:
+                        n_output = usage.get("completion_tokens", n_output)
+                        cached_tokens = _extract_cached_tokens(usage)
+        except Exception as exc:
+            err = repr(exc)[:300]
+
+    end = time.perf_counter()
+    e2e = end - start
+    if n_output == 0 and token_times:
+        n_output = len(token_times)
+
+    tpot = 0.0
+    if len(token_times) > 1:
+        inter_token = [token_times[i+1] - token_times[i]
+                       for i in range(len(token_times) - 1)]
+        tpot = sum(inter_token) / len(inter_token)
+
+    return RequestMetrics(
+        request_id=req.request_id,
+        session_id=req.session_id,
+        turn_id=req.turn_id,
+        trace_timestamp_s=req.timestamp_s,
+        input_length=req.input_length,
+        output_length=req.output_length,
+        request_type=req.request_type,
+        effective_input_length=len(prompt_token_ids),
+        cached_tokens=cached_tokens,
+        latency_s=e2e,
+        ttft_s=ttft_s,
+        tpot_s=tpot,
+        actual_output_tokens=n_output,
+        requested_output_tokens=req.output_length,
+        finish_reason=finish_reason,
+        error=err,
+    )
+
+
+def _extract_cached_tokens(usage: dict) -> int:
+    ct = 0
+    details = usage.get("prompt_tokens_details")
+    if isinstance(details, dict):
+        ct = details.get("cached_tokens", 0) or 0
+    if ct == 0:
+        ct = usage.get("cached_tokens", 0) or 0
+    return int(ct)
+
+
+async def _run_session(
+    *,
+    state: _SessionState,
+    config: ReplayConfig,
+    client: httpx.AsyncClient,
+    session_sem: asyncio.Semaphore,
+    request_sem: asyncio.Semaphore,
+    earliest_ts: float,
+    sweep_start: float,
+    sink: IncrementalMetricSink,
+) -> list[RequestMetrics]:
+    async with session_sem:
+        # Wait until this session's start time
+        offset = (state.turns[0].timestamp_s - earliest_ts) / config.time_scale
+        wait = offset - (time.perf_counter() - sweep_start)
+        if wait > 0:
+            await asyncio.sleep(wait)
+
+        for req in state.turns:
+            # Intra-session: wait for turn's relative offset
+            if req != state.turns[0]:
+                target = (req.timestamp_s - state.turns[0].timestamp_s) / config.time_scale
+                elapsed = time.perf_counter() - sweep_start - offset
+                if elapsed < target:
+                    await asyncio.sleep(target - elapsed)
+
+            token_ids = _build_prompt_token_ids(req)
+            metric = await _dispatch_request(
+                client=client, config=config, req=req,
+                prompt_token_ids=token_ids, sem=request_sem,
+            )
+            state.metrics.append(metric)
+            await sink.append(metric)
+
+    return state.metrics
+
+
+async def _snapshot_prefix_cache_metrics(url_csv: str) -> dict[str, float]:
+    """Scrape vLLM /metrics for prefix cache counters (aggregated across endpoints)."""
+    total = {"queries": 0.0, "hits": 0.0}
+    endpoints = [e.strip() for e in url_csv.split(",")]
+    async with httpx.AsyncClient(timeout=10) as c:
+        for url in endpoints:
+            try:
+                r = await c.get(f"{url}/metrics")
+                for line in r.text.split("\n"):
+                    if line.startswith("vllm:prefix_cache_queries_total"):
+                        total["queries"] += float(line.split()[-1])
+                    elif line.startswith("vllm:prefix_cache_hits_total"):
+                        total["hits"] += float(line.split()[-1])
+            except Exception:
+                pass
+    return total
+
+
+async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
+    """Main entry: load trace, replay against endpoint, return metrics."""
+    requests = load_trace(config.trace_path, request_limit=config.request_limit)
+    if not requests:
+        return []
+
+    by_session: dict[str, list[TraceRequest]] = defaultdict(list)
+    for r in requests:
+        by_session[r.session_id].append(r)
+    for sid in by_session:
+        by_session[sid].sort(key=lambda r: (r.turn_id, r.timestamp_s))
+
+    sessions = sorted(by_session.items(), key=lambda kv: kv[1][0].timestamp_s)
+    earliest_ts = sessions[0][1][0].timestamp_s
+
+    session_sem = asyncio.Semaphore(config.max_inflight_sessions)
+    request_sem = asyncio.Semaphore(config.concurrency_limit)
+
+    sink = IncrementalMetricSink(config.output_path)
+
+    n_sessions = len(sessions)
+    n_requests = len(requests)
+    logger.info("Replaying %d sessions (%d requests), time_scale=%.1f",
+                n_sessions, n_requests, config.time_scale)
+
+    pre_metrics = await _snapshot_prefix_cache_metrics(config.endpoint_url)
+    sweep_start = time.perf_counter()
+
+    try:
+        limits = httpx.Limits(
+            max_connections=2000,
+            max_keepalive_connections=500,
+            keepalive_expiry=30.0,
+        )
+        async with httpx.AsyncClient(
+            timeout=config.request_timeout_s,
+            trust_env=False,
+            limits=limits,
+        ) as client:
+            tasks = [
+                asyncio.create_task(_run_session(
+                    state=_SessionState(session_id=sid, turns=turns),
+                    config=config, client=client,
+                    session_sem=session_sem, request_sem=request_sem,
+                    earliest_ts=earliest_ts, sweep_start=sweep_start,
+                    sink=sink,
+                ))
+                for sid, turns in sessions
+            ]
+            all_results = await asyncio.gather(*tasks)
+    finally:
+        sink.close()
+
+    sweep_elapsed = time.perf_counter() - sweep_start
+    post_metrics = await _snapshot_prefix_cache_metrics(config.endpoint_url)
+
+    flat = [m for group in all_results for m in group]
+    summary_path = config.output_path.with_suffix(".summary.json")
+    write_summary_json(summary_path, flat)
+
+    # Compute aggregate prefix cache hit ratio from /metrics deltas
+    delta_queries = post_metrics.get("queries", 0) - pre_metrics.get("queries", 0)
+    delta_hits = post_metrics.get("hits", 0) - pre_metrics.get("hits", 0)
+    hit_ratio = delta_hits / delta_queries if delta_queries > 0 else 0.0
+
+    logger.info("Done: %d/%d succeeded in %.1fs", sum(1 for m in flat if m.error is None), len(flat), sweep_elapsed)
+    logger.info("Prefix cache: %.1f%% hit ratio (%d/%d tokens)",
+                hit_ratio * 100, int(delta_hits), int(delta_queries))
+
+    # Append cache stats to summary
+    import json as _json
+    summary = _json.loads(summary_path.read_text())
+    summary["prefix_cache_queries_tokens"] = int(delta_queries)
+    summary["prefix_cache_hits_tokens"] = int(delta_hits)
+    summary["prefix_cache_hit_ratio"] = hit_ratio
+    summary["wall_clock_s"] = sweep_elapsed
+    summary_path.write_text(_json.dumps(summary, indent=2, sort_keys=True))
+
+    logger.info("Summary written to %s", summary_path)
+    return flat
--- a/replayer/trace.py
+++ b/replayer/trace.py
@@ -0,0 +1,84 @@
+"""Trace data structures and loader for the Ali agentic-coder trace format.
+
+Trace format (one JSON per line):
+  chat_id, parent_chat_id, timestamp, input_length, output_length,
+  type, turn, hash_ids[]
+
+Sessions are derived from parent_chat_id chains:
+  - parent_chat_id == -1  →  new session root
+  - parent_chat_id >= 0   →  belongs to the same session as the parent
+"""
+
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+from pathlib import Path
+
+
+@dataclass(frozen=True)
+class TraceRequest:
+    request_id: str
+    session_id: str
+    chat_id: int
+    parent_chat_id: int
+    timestamp_s: float
+    input_length: int
+    output_length: int
+    request_type: str
+    turn_id: int
+    hash_ids: tuple[int, ...]
+
+
+def load_trace(
+    path: Path,
+    *,
+    request_limit: int | None = None,
+) -> list[TraceRequest]:
+    """Load trace and resolve session IDs from parent_chat_id chains."""
+    chat_to_session: dict[int, str] = {}
+    requests: list[TraceRequest] = []
+
+    with path.open("r", encoding="utf-8") as fh:
+        for idx, line in enumerate(fh):
+            if request_limit is not None and len(requests) >= request_limit:
+                break
+            row = json.loads(line)
+            chat_id = int(row["chat_id"])
+            parent_chat_id = int(row["parent_chat_id"])
+
+            if "session_id" in row:
+                session_id = str(row["session_id"])
+            else:
+                session_id = _resolve_session_id(
+                    chat_id, parent_chat_id, chat_to_session,
+                )
+            chat_to_session[chat_id] = session_id
+
+            requests.append(TraceRequest(
+                request_id=f"{session_id}:{row['turn']}:{chat_id}:{idx}",
+                session_id=session_id,
+                chat_id=chat_id,
+                parent_chat_id=parent_chat_id,
+                timestamp_s=float(row["timestamp"]),
+                input_length=int(row["input_length"]),
+                output_length=int(row["output_length"]),
+                request_type=str(row["type"]),
+                turn_id=int(row["turn"]),
+                hash_ids=tuple(int(h) for h in row.get("hash_ids", [])),
+            ))
+
+    return requests
+
+
+def _resolve_session_id(
+    chat_id: int,
+    parent_chat_id: int,
+    chat_to_session: dict[int, str],
+) -> str:
+    if parent_chat_id < 0:
+        session_id = str(chat_id)
+    else:
+        session_id = chat_to_session.get(parent_chat_id, str(parent_chat_id))
+    chat_to_session[chat_id] = session_id
+    return session_id
--- a/scripts/analyze_cache_hit.py
+++ b/scripts/analyze_cache_hit.py
@@ -0,0 +1,196 @@
+"""Analyze theoretical vs actual KV cache hit ratio for the agentic trace."""
+import json
+from collections import Counter
+
+rows = [json.loads(l) for l in open("traces/sampled_1000req_seed42.jsonl")]
+rows.sort(key=lambda r: float(r["timestamp"]))
+
+BLOCK_SIZE = 512
+
+# === 1. Theoretical max: infinite cache, single instance ===
+total_tokens = 0
+total_cached = 0
+seen_blocks = set()
+per_req = []
+
+for r in rows:
+    input_len = r["input_length"]
+    hash_ids = r.get("hash_ids", [])
+    total_tokens += input_len
+
+    cached_blocks = 0
+    prefix_broken = False
+    for hid in hash_ids:
+        if not prefix_broken and hid in seen_blocks:
+            cached_blocks += 1
+        else:
+            prefix_broken = True
+
+    cached_tokens = cached_blocks * BLOCK_SIZE
+    total_cached += cached_tokens
+    for hid in hash_ids:
+        seen_blocks.add(hid)
+
+    per_req.append({
+        "input_length": input_len,
+        "cached_tokens": cached_tokens,
+        "new_tokens": max(0, input_len - cached_tokens),
+        "ratio": cached_tokens / input_len if input_len > 0 else 0,
+    })
+
+sep = "=" * 70
+print(sep)
+print("  THEORETICAL KV CACHE HIT (infinite cache, single instance)")
+print(sep)
+print(f"  Total input tokens:     {total_tokens:>14,}")
+print(f"  Cacheable (prefix hit): {total_cached:>14,}  ({total_cached*100//total_tokens}%)")
+print(f"  Must prefill (new):     {total_tokens-total_cached:>14,}  ({(total_tokens-total_cached)*100//total_tokens}%)")
+
+ratios = sorted([s["ratio"] for s in per_req if s["input_length"] > 0])
+new_tokens = sorted([s["new_tokens"] for s in per_req if s["input_length"] > 0])
+p = lambda v, q: v[min(int(q*len(v)), len(v)-1)]
+
+print(f"\n  Per-request cache hit ratio:")
+print(f"    p10={p(ratios,.1)*100:.1f}%  p50={p(ratios,.5)*100:.1f}%  p90={p(ratios,.9)*100:.1f}%  mean={sum(ratios)/len(ratios)*100:.1f}%")
+high = sum(1 for r in ratios if r > 0.5)
+very_high = sum(1 for r in ratios if r > 0.9)
+zero = sum(1 for r in ratios if r == 0)
+print(f"    0% hit (cold start):  {zero} ({zero*100//len(ratios)}%)")
+print(f"    >50% hit:             {high} ({high*100//len(ratios)}%)")
+print(f"    >90% hit:             {very_high} ({very_high*100//len(ratios)}%)")
+
+print(f"\n  Actual new tokens to prefill per request:")
+print(f"    p10={p(new_tokens,.1):>7,}  p50={p(new_tokens,.5):>7,}  p90={p(new_tokens,.9):>7,}  max={max(new_tokens):>7,}")
+
+# === 2. 4-instance split (simulating DP=4 or 4 prefill instances) ===
+print(f"\n{sep}")
+print("  4-INSTANCE SPLIT (round-robin, per-instance cache)")
+print(sep)
+
+instance_seen = [set() for _ in range(4)]
+inst_total = [0]*4
+inst_cached = [0]*4
+
+for i, r in enumerate(rows):
+    inst = i % 4
+    input_len = r["input_length"]
+    hash_ids = r.get("hash_ids", [])
+    inst_total[inst] += input_len
+
+    cached_blocks = 0
+    prefix_broken = False
+    for hid in hash_ids:
+        if not prefix_broken and hid in instance_seen[inst]:
+            cached_blocks += 1
+        else:
+            prefix_broken = True
+
+    inst_cached[inst] += cached_blocks * BLOCK_SIZE
+    for hid in hash_ids:
+        instance_seen[inst].add(hid)
+
+rr_total = sum(inst_total)
+rr_cached = sum(inst_cached)
+print(f"  Cache hit ratio (RR):   {rr_cached*100//rr_total}%")
+
+# === 3. Cache-aware routing (route to instance with best prefix match) ===
+print(f"\n{sep}")
+print("  4-INSTANCE CACHE-AWARE ROUTING")
+print(sep)
+
+ca_seen = [set() for _ in range(4)]
+ca_total = [0]*4
+ca_cached = [0]*4
+
+for r in rows:
+    input_len = r["input_length"]
+    hash_ids = r.get("hash_ids", [])
+
+    # Pick instance with most prefix blocks cached
+    best_inst = 0
+    best_hit = 0
+    for inst in range(4):
+        hit = 0
+        for hid in hash_ids:
+            if hid in ca_seen[inst]:
+                hit += 1
+            else:
+                break
+        if hit > best_hit:
+            best_hit = hit
+            best_inst = inst
+
+    ca_total[best_inst] += input_len
+    ca_cached[best_inst] += best_hit * BLOCK_SIZE
+    for hid in hash_ids:
+        ca_seen[best_inst].add(hid)
+
+ca_total_sum = sum(ca_total)
+ca_cached_sum = sum(ca_cached)
+print(f"  Cache hit ratio:        {ca_cached_sum*100//ca_total_sum}%")
+print(f"  vs RR:                  {rr_cached*100//rr_total}% -> {ca_cached_sum*100//ca_total_sum}% (+{(ca_cached_sum-rr_cached)*100//rr_total}pp)")
+
+# === 4. Session structure analysis ===
+print(f"\n{sep}")
+print("  SESSION & MULTI-TURN ANALYSIS")
+print(sep)
+
+sessions = {}
+chat_to_session = {}
+for r in rows:
+    cid = int(r["chat_id"])
+    pid = int(r["parent_chat_id"])
+    sid = r.get("session_id", str(cid) if pid < 0 else chat_to_session.get(pid, str(pid)))
+    chat_to_session[cid] = str(sid)
+    sessions.setdefault(str(sid), []).append(r)
+
+multi = {k: v for k, v in sessions.items() if len(v) > 1}
+single = {k: v for k, v in sessions.items() if len(v) == 1}
+
+print(f"  Sessions: {len(sessions)} total, {len(multi)} multi-turn ({len(multi)*100//len(sessions)}%)")
+
+# Multi-turn: cache hit in turn 2+
+mt_new = 0
+mt_reuse = 0
+for sid, turns in multi.items():
+    turns.sort(key=lambda r: r["turn"])
+    prev_blocks = set()
+    for t in turns:
+        hids = t.get("hash_ids", [])
+        for hid in hids:
+            if hid in prev_blocks:
+                mt_reuse += BLOCK_SIZE
+            else:
+                mt_new += BLOCK_SIZE
+            prev_blocks.add(hid)
+
+mt_total_tok = mt_new + mt_reuse
+print(f"  Multi-turn intra-session reuse: {mt_reuse*100//mt_total_tok}% of tokens")
+print(f"    (Turn 2+ reuses KV from prior turns in same session)")
+
+# Single-turn: cross-session sharing via system prompt
+block_freq = Counter()
+for r in rows:
+    for hid in r.get("hash_ids", []):
+        block_freq[hid] += 1
+
+shared = {k: v for k, v in block_freq.items() if v > 1}
+top = block_freq.most_common(5)
+print(f"\n  Cross-session block sharing:")
+print(f"    Unique blocks: {len(block_freq):,}")
+print(f"    Shared (ref>1): {len(shared):,} ({len(shared)*100//len(block_freq)}%)")
+print(f"    Top-5 block ref counts: {[c for _,c in top]}")
+print(f"    (Shared blocks = system prompt / common code context)")
+
+# === 5. Implication for PD separation ===
+print(f"\n{sep}")
+print("  IMPLICATION FOR PD SEPARATION")
+print(sep)
+actual_prefill_pct = (total_tokens - total_cached) * 100 // total_tokens
+print(f"  With perfect caching, only {actual_prefill_pct}% of tokens need actual prefill compute.")
+print(f"  The remaining {100-actual_prefill_pct}% are prefix cache hits (skip prefill, reuse KV).")
+print(f"  This means PD separation's prefill overhead is much smaller than it appears:")
+print(f"    - Nominal avg input: {total_tokens//len(rows):,} tokens/request")
+new_per_req = sorted([s["new_tokens"] for s in per_req if s["input_length"] > 0])
+print(f"    - Actual avg prefill: {sum(new_per_req)//len(new_per_req):,} tokens/request (after cache hit)")
+print(f"    - KV transfer size is also reduced (only transfer new blocks)")
--- a/scripts/analyze_trace.py
+++ b/scripts/analyze_trace.py
@@ -0,0 +1,163 @@
+"""Analyze trace patterns to assess PD separation benefit.
+
+Computes metrics relevant to deciding PD-combined vs PD-separated:
+  - Input/output token ratio (high ratio = prefill-heavy → PD sep benefits)
+  - Prefix sharing density (high sharing → benefits from shared KV cache)
+  - Session length distribution (multi-turn = more prefix reuse)
+  - Arrival burstiness (bursty prefill → PD sep can absorb spikes)
+  - Compute-intensity ratio: prefill FLOP share vs decode FLOP share
+
+Usage:
+    python scripts/analyze_trace.py --input traces/sampled_1000req_seed42.jsonl
+"""
+
+from __future__ import annotations
+
+import argparse
+import collections
+import json
+import statistics
+from pathlib import Path
+
+
+def main():
+    p = argparse.ArgumentParser(description=__doc__,
+                                formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--input", type=Path, required=True)
+    args = p.parse_args()
+
+    rows = []
+    with args.input.open() as fh:
+        for line in fh:
+            rows.append(json.loads(line))
+
+    # Session structure
+    sessions: dict[str, list[dict]] = collections.OrderedDict()
+    chat_to_session: dict[int, str] = {}
+    for r in rows:
+        cid = int(r["chat_id"])
+        pid = int(r["parent_chat_id"])
+        sid = r.get("session_id")
+        if sid is None:
+            sid = str(cid) if pid < 0 else chat_to_session.get(pid, str(pid))
+        chat_to_session[cid] = str(sid)
+        sessions.setdefault(str(sid), []).append(r)
+
+    n_sessions = len(sessions)
+    turns_per_session = [len(v) for v in sessions.values()]
+    multi_turn = sum(1 for t in turns_per_session if t > 1)
+
+    input_lens = [r["input_length"] for r in rows]
+    output_lens = [r["output_length"] for r in rows]
+    total_input = sum(input_lens)
+    total_output = sum(output_lens)
+
+    print("=" * 60)
+    print("Trace Pattern Analysis for PD Separation Decision")
+    print("=" * 60)
+
+    # 1. Input/Output ratio
+    io_ratio = total_input / max(total_output, 1)
+    print(f"\n1. Input/Output Token Ratio")
+    print(f"   Total input tokens:  {total_input:>12,}")
+    print(f"   Total output tokens: {total_output:>12,}")
+    print(f"   I/O ratio:           {io_ratio:>12.1f}x")
+    print(f"   → {'STRONGLY' if io_ratio > 50 else 'Moderately' if io_ratio > 10 else 'Weakly'} prefill-heavy")
+
+    # 2. Prefill compute share
+    # Approximate: prefill FLOP ∝ input_length, decode FLOP ∝ output_length * input_length
+    # More precisely: prefill dominates when input >> output
+    prefill_share = total_input / (total_input + total_output)
+    print(f"\n2. Compute Split (token count proxy)")
+    print(f"   Prefill share: {prefill_share*100:.1f}%")
+    print(f"   Decode share:  {(1-prefill_share)*100:.1f}%")
+
+    # 3. Session structure
+    print(f"\n3. Session Structure")
+    print(f"   Sessions:    {n_sessions}")
+    print(f"   Requests:    {len(rows)}")
+    print(f"   Multi-turn:  {multi_turn} ({multi_turn/n_sessions*100:.1f}%)")
+    print(f"   Turns/sess:  min={min(turns_per_session)} max={max(turns_per_session)} "
+          f"avg={statistics.fmean(turns_per_session):.1f}")
+
+    # 4. Prefix sharing
+    all_hash_ids = set()
+    per_request_hashes = []
+    for r in rows:
+        hids = set(r.get("hash_ids", []))
+        per_request_hashes.append(hids)
+        all_hash_ids.update(hids)
+
+    hash_refcount = collections.Counter()
+    for hids in per_request_hashes:
+        for h in hids:
+            hash_refcount[h] += 1
+
+    shared_blocks = sum(1 for h, c in hash_refcount.items() if c > 1)
+    total_blocks = len(all_hash_ids)
+    block_reuse = shared_blocks / max(total_blocks, 1)
+    avg_refcount = statistics.fmean(hash_refcount.values()) if hash_refcount else 0
+
+    print(f"\n4. Prefix Block Sharing")
+    print(f"   Unique blocks:    {total_blocks:>10,}")
+    print(f"   Shared (ref>1):   {shared_blocks:>10,} ({block_reuse*100:.1f}%)")
+    print(f"   Avg refcount:     {avg_refcount:>10.2f}")
+    print(f"   → {'High' if block_reuse > 0.3 else 'Moderate' if block_reuse > 0.1 else 'Low'} prefix reuse potential")
+
+    # 5. Input length distribution
+    input_sorted = sorted(input_lens)
+    pct = lambda q: input_sorted[min(int(q * len(input_sorted)), len(input_sorted) - 1)]
+    print(f"\n5. Input Length Distribution")
+    print(f"   p10={pct(0.1):>8,}  p50={pct(0.5):>8,}  p90={pct(0.9):>8,}  max={max(input_lens):>8,}")
+    long_context = sum(1 for l in input_lens if l > 32000)
+    print(f"   Requests >32k tokens: {long_context} ({long_context/len(rows)*100:.1f}%)")
+
+    # 6. Arrival pattern
+    timestamps = sorted(float(r["timestamp"]) for r in rows)
+    span = timestamps[-1] - timestamps[0]
+    avg_rate = len(rows) / max(span, 0.001)
+
+    # Burstiness: coefficient of variation of inter-arrival times
+    inter_arrivals = [timestamps[i+1] - timestamps[i] for i in range(len(timestamps) - 1)]
+    inter_arrivals = [t for t in inter_arrivals if t > 0]
+    if inter_arrivals:
+        cv = statistics.stdev(inter_arrivals) / statistics.fmean(inter_arrivals)
+    else:
+        cv = 0
+    print(f"\n6. Arrival Pattern")
+    print(f"   Span: {span:.1f}s ({span/60:.1f} min)")
+    print(f"   Avg rate: {avg_rate:.2f} req/s")
+    print(f"   Burstiness (CoV): {cv:.2f}")
+    print(f"   → {'Bursty' if cv > 1.5 else 'Moderate' if cv > 0.8 else 'Steady'} arrival pattern")
+
+    # Summary
+    print(f"\n{'=' * 60}")
+    print("Summary: PD Separation Recommendation")
+    print(f"{'=' * 60}")
+    factors = []
+    if io_ratio > 50:
+        factors.append("Very high I/O ratio (prefill-dominated)")
+    elif io_ratio > 10:
+        factors.append("High I/O ratio")
+    if block_reuse > 0.1:
+        factors.append(f"Significant prefix reuse ({block_reuse*100:.0f}% shared blocks)")
+    if long_context / len(rows) > 0.3:
+        factors.append(f"Many long-context requests ({long_context/len(rows)*100:.0f}%)")
+    if cv > 1.0:
+        factors.append("Bursty arrivals (PD sep absorbs prefill spikes)")
+
+    if len(factors) >= 2:
+        print("→ RECOMMEND PD separation:")
+    elif len(factors) == 1:
+        print("→ PD separation MAY help:")
+    else:
+        print("→ PD separation likely NOT beneficial:")
+
+    for f in factors:
+        print(f"  • {f}")
+    if not factors:
+        print("  • No strong indicators for PD separation benefit")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/cache_aware_proxy.py
+++ b/scripts/cache_aware_proxy.py
@@ -0,0 +1,280 @@
+"""Unified cache-aware + token-level load-balanced global scheduler.
+
+Supports two modes:
+  --combined URL [URL ...]: PD co-located instances (normal vLLM, no KV transfer)
+  --prefill URL BP --decode URL: PD disaggregated instances (Mooncake KV transfer)
+
+Routing policy (same for both modes):
+  score = ongoing_tokens / avg_ongoing  -  ALPHA * cache_hit_ratio
+  Normalized load prevents "rich get richer"; cache bonus gives affinity.
+  Session affinity: multi-turn sessions stick to same instance.
+"""
+
+import argparse
+import asyncio
+import os
+import urllib.parse
+import uuid
+from contextlib import asynccontextmanager
+
+import httpx
+import uvicorn
+from fastapi import FastAPI, HTTPException, Request
+from fastapi.responses import StreamingResponse
+
+BLOCK_SIZE = 512
+CACHE_HIT_ALPHA = 1.0  # weight for cache bonus in scoring
+
+
+class InstanceState:
+    def __init__(self, url: str, bootstrap_port: int | None = None):
+        self.url = url
+        self.bootstrap_port = bootstrap_port
+        self.client = httpx.AsyncClient(
+            timeout=None, base_url=url,
+            limits=httpx.Limits(max_connections=None, max_keepalive_connections=None),
+        )
+        self.ongoing_tokens = 0
+        self.engine_id: dict[int, str] = {}
+        self.dp_size = 1
+        self.cached_blocks: set[int] = set()
+
+    def estimate_cache_hit(self, token_ids: list[int] | None) -> int:
+        if not token_ids or len(token_ids) < BLOCK_SIZE:
+            return 0
+        hit = 0
+        for i in range(0, len(token_ids) - BLOCK_SIZE + 1, BLOCK_SIZE):
+            bh = hash(tuple(token_ids[i:i + BLOCK_SIZE]))
+            if bh in self.cached_blocks:
+                hit += BLOCK_SIZE
+            else:
+                break
+        return hit
+
+    def record_prefix(self, token_ids: list[int] | None):
+        if not token_ids:
+            return
+        for i in range(0, len(token_ids) - BLOCK_SIZE + 1, BLOCK_SIZE):
+            self.cached_blocks.add(hash(tuple(token_ids[i:i + BLOCK_SIZE])))
+        if len(self.cached_blocks) > 200000:
+            self.cached_blocks = set(list(self.cached_blocks)[-100000:])
+
+
+def pick_instance(instances: list[InstanceState], token_ids: list[int] | None,
+                  session_id: str | None, input_length: int,
+                  affinity: dict[str, int]) -> tuple[InstanceState, int]:
+    """Normalized load - cache bonus scoring."""
+    if session_id and session_id in affinity:
+        idx = affinity[session_id]
+        if idx < len(instances):
+            return instances[idx], idx
+
+    avg_load = max(sum(i.ongoing_tokens for i in instances) / len(instances), 1.0)
+    best_idx, best_score = 0, float("inf")
+    for i, inst in enumerate(instances):
+        cache_hit = inst.estimate_cache_hit(token_ids)
+        cache_ratio = cache_hit / input_length if input_length > 0 else 0.0
+        score = inst.ongoing_tokens / avg_load - CACHE_HIT_ALPHA * cache_ratio
+        if score < best_score:
+            best_score = score
+            best_idx = i
+
+    if session_id:
+        affinity[session_id] = best_idx
+    return instances[best_idx], best_idx
+
+
+global_args = None
+combined_instances: list[InstanceState] = []
+prefill_instances: list[InstanceState] = []
+decode_instances: list[InstanceState] = []
+session_affinity: dict[str, int] = {}
+is_pd_sep = False
+
+
+async def init_prefill_bootstrap(instances: list[InstanceState], ready: asyncio.Event):
+    for inst in instances:
+        if inst.bootstrap_port is None:
+            continue
+        while True:
+            try:
+                await inst.client.get("/health")
+            except Exception:
+                await asyncio.sleep(1)
+                continue
+            parsed = urllib.parse.urlparse(str(inst.client.base_url))
+            url = f"http://{parsed.hostname}:{inst.bootstrap_port}/query"
+            resp = await inst.client.get(url)
+            resp.raise_for_status()
+            data = resp.json()
+            for dp_rank, dp_entry in data.items():
+                inst.engine_id[int(dp_rank)] = dp_entry["engine_id"]
+            inst.dp_size = len(data)
+            print(f"Inited {inst.url} engine_ids={inst.engine_id}")
+            break
+    ready.set()
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    global is_pd_sep
+    app.state.ready = asyncio.Event()
+
+    if global_args.combined:
+        is_pd_sep = False
+        for url in global_args.combined:
+            combined_instances.append(InstanceState(url))
+        app.state.ready.set()
+        print(f"Combined mode: {len(combined_instances)} instances")
+    else:
+        is_pd_sep = True
+        for url, bp in global_args.prefill:
+            prefill_instances.append(InstanceState(url, bp))
+        for url in global_args.decode:
+            decode_instances.append(InstanceState(url))
+        asyncio.create_task(init_prefill_bootstrap(prefill_instances, app.state.ready))
+        print(f"PD-Sep mode: {len(prefill_instances)}P + {len(decode_instances)}D")
+
+    yield
+    for inst in combined_instances + prefill_instances + decode_instances:
+        await inst.client.aclose()
+
+
+app = FastAPI(lifespan=lifespan)
+
+
+@app.post("/v1/completions")
+async def handle_completions(request: Request):
+    return await _handle(request, "/v1/completions")
+
+
+@app.post("/v1/chat/completions")
+async def handle_chat(request: Request):
+    return await _handle(request, "/v1/chat/completions")
+
+
+async def _handle(request: Request, api: str):
+    if not app.state.ready.is_set():
+        raise HTTPException(status_code=503, detail="Service Unavailable")
+
+    req_data = await request.json()
+    request_id = str(uuid.uuid4())
+    prompt = req_data.get("prompt")
+    token_ids = prompt if isinstance(prompt, list) else None
+    input_length = len(token_ids) if token_ids else 0
+    session_id = request.headers.get("X-Session-Id")
+
+    headers = {"X-Request-Id": request_id}
+    api_key = os.environ.get("OPENAI_API_KEY")
+    if api_key:
+        headers["Authorization"] = f"Bearer {api_key}"
+
+    if is_pd_sep:
+        return await _handle_pd_sep(api, req_data, request_id, token_ids,
+                                     input_length, session_id, headers)
+    else:
+        return await _handle_combined(api, req_data, token_ids,
+                                       input_length, session_id, headers)
+
+
+async def _handle_combined(api, req_data, token_ids, input_length, session_id, headers):
+    """Combined mode: route to best instance, send normal request."""
+    inst, idx = pick_instance(combined_instances, token_ids, session_id,
+                               input_length, session_affinity)
+    inst.ongoing_tokens += input_length
+
+    async def generate():
+        try:
+            async with inst.client.stream("POST", api, json=req_data, headers=headers) as resp:
+                resp.raise_for_status()
+                async for chunk in resp.aiter_bytes():
+                    yield chunk
+            inst.record_prefix(token_ids)
+        finally:
+            inst.ongoing_tokens -= input_length
+
+    return StreamingResponse(generate(), media_type="text/event-stream")
+
+
+async def _handle_pd_sep(api, req_data, request_id, token_ids, input_length,
+                          session_id, headers):
+    """PD-Sep mode: await prefill, then stream decode."""
+    p_inst, _ = pick_instance(prefill_instances, token_ids, session_id,
+                               input_length, session_affinity)
+    d_inst = min(decode_instances, key=lambda x: x.ongoing_tokens)
+
+    # Await prefill
+    p_inst.ongoing_tokens += input_length
+    try:
+        prefill_data = req_data.copy()
+        prefill_data["kv_transfer_params"] = {
+            "do_remote_decode": True, "do_remote_prefill": False,
+            "transfer_id": f"xfer-{request_id}",
+        }
+        prefill_data["stream"] = False
+        prefill_data["max_tokens"] = 1
+        prefill_data.pop("max_completion_tokens", None)
+        prefill_data.pop("stream_options", None)
+
+        p_headers = {**headers, "X-data-parallel-rank": "0"}
+        resp = await p_inst.client.post(api, json=prefill_data, headers=p_headers)
+        resp.raise_for_status()
+        await resp.aclose()
+        p_inst.record_prefix(token_ids)
+    except Exception as e:
+        raise HTTPException(status_code=502, detail=f"Prefill failed: {e}")
+    finally:
+        p_inst.ongoing_tokens -= input_length
+
+    # Stream decode
+    d_inst.ongoing_tokens += input_length
+    parsed = urllib.parse.urlparse(str(p_inst.client.base_url))
+    bootstrap_addr = f"http://{parsed.hostname}:{p_inst.bootstrap_port}"
+
+    decode_data = req_data.copy()
+    decode_data["kv_transfer_params"] = {
+        "do_remote_decode": False, "do_remote_prefill": True,
+        "remote_bootstrap_addr": bootstrap_addr,
+        "remote_engine_id": p_inst.engine_id.get(0, ""),
+        "transfer_id": f"xfer-{request_id}",
+    }
+
+    async def generate():
+        try:
+            async with d_inst.client.stream("POST", api, json=decode_data, headers=headers) as resp:
+                resp.raise_for_status()
+                async for chunk in resp.aiter_bytes():
+                    yield chunk
+        finally:
+            d_inst.ongoing_tokens -= input_length
+
+    return StreamingResponse(generate(), media_type="application/json")
+
+
+def parse_args():
+    p = argparse.ArgumentParser(description="Unified cache-aware global scheduler")
+    p.add_argument("--port", type=int, default=8000)
+    p.add_argument("--host", type=str, default="0.0.0.0")
+    p.add_argument("--combined", nargs="+", help="Combined mode: list of instance URLs")
+    p.add_argument("--prefill", nargs="+", action="append", dest="prefill_raw",
+                   help="PD-Sep prefill: URL [bootstrap_port]")
+    p.add_argument("--decode", nargs=1, action="append", dest="decode_raw",
+                   help="PD-Sep decode: URL")
+    args = p.parse_args()
+
+    args.prefill = []
+    if args.prefill_raw:
+        for entry in args.prefill_raw:
+            url = entry[0]
+            bp = int(entry[1]) if len(entry) > 1 and entry[1].lower() != "none" else None
+            args.prefill.append((url, bp))
+    args.decode = [e[0] for e in (args.decode_raw or [])]
+
+    if not args.combined and not args.prefill:
+        p.error("Must specify either --combined or --prefill/--decode")
+    return args
+
+
+if __name__ == "__main__":
+    global_args = parse_args()
+    uvicorn.run(app, host=global_args.host, port=global_args.port)
--- a/scripts/compare_results.py
+++ b/scripts/compare_results.py
@@ -0,0 +1,102 @@
+"""Compare benchmark results between PD-combined and PD-separated modes.
+
+Reads summary JSON files and per-request metrics to produce a detailed
+comparison report including TTFT, TPOT, E2E, cache hit ratio, and
+throughput analysis.
+
+Usage:
+    python scripts/compare_results.py \
+        --combined outputs/combined_1000req/metrics.summary.json \
+        --separated outputs/separated_1000req/metrics.summary.json
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+
+def load_summary(path: Path) -> dict:
+    return json.loads(path.read_text())
+
+
+def load_metrics(path: Path) -> list[dict]:
+    rows = []
+    with path.open() as fh:
+        for line in fh:
+            rows.append(json.loads(line))
+    return rows
+
+
+def fmt_stat(stat: dict | None, unit: str = "s") -> str:
+    if stat is None:
+        return "N/A"
+    return (f"mean={stat['mean']:.3f}{unit} "
+            f"p50={stat['p50']:.3f}{unit} "
+            f"p90={stat['p90']:.3f}{unit} "
+            f"p99={stat['p99']:.3f}{unit}")
+
+
+def compare(combined: dict, separated: dict) -> None:
+    print("=" * 70)
+    print("PD-Combined vs PD-Separated Performance Comparison")
+    print("=" * 70)
+
+    for label, s in [("PD-Combined", combined), ("PD-Separated", separated)]:
+        print(f"\n--- {label} ---")
+        print(f"  Requests: {s['request_count']} (success: {s['success_count']}, errors: {s['error_count']})")
+        print(f"  Wall clock: {s.get('wall_clock_s', 0):.1f}s")
+        print(f"  TTFT:    {fmt_stat(s.get('ttft_stats_s'))}")
+        print(f"  TPOT:    {fmt_stat(s.get('tpot_stats_s'))}")
+        print(f"  E2E:     {fmt_stat(s.get('latency_stats_s'))}")
+        hit_ratio = s.get('prefix_cache_hit_ratio', 0)
+        print(f"  Prefix cache hit ratio: {hit_ratio*100:.1f}%")
+        queries = s.get('prefix_cache_queries_tokens', 0)
+        hits = s.get('prefix_cache_hits_tokens', 0)
+        print(f"    ({hits}/{queries} tokens)")
+
+    print("\n--- Comparison (Separated vs Combined) ---")
+    for metric_key, label in [
+        ("ttft_stats_s", "TTFT"),
+        ("tpot_stats_s", "TPOT"),
+        ("latency_stats_s", "E2E"),
+    ]:
+        c = combined.get(metric_key, {})
+        s = separated.get(metric_key, {})
+        if c and s:
+            for pct in ["mean", "p50", "p90", "p99"]:
+                cv, sv = c.get(pct, 0), s.get(pct, 0)
+                if cv > 0:
+                    change = (sv - cv) / cv * 100
+                    direction = "slower" if change > 0 else "faster"
+                    print(f"  {label} {pct}: {abs(change):.1f}% {direction} "
+                          f"({cv:.3f}s → {sv:.3f}s)")
+
+    c_ratio = combined.get("prefix_cache_hit_ratio", 0)
+    s_ratio = separated.get("prefix_cache_hit_ratio", 0)
+    print(f"  Cache hit ratio: {c_ratio*100:.1f}% → {s_ratio*100:.1f}%")
+
+    c_wall = combined.get("wall_clock_s", 1)
+    s_wall = separated.get("wall_clock_s", 1)
+    c_tput = combined["success_count"] / c_wall
+    s_tput = separated["success_count"] / s_wall
+    print(f"  Throughput: {c_tput:.1f} → {s_tput:.1f} req/s "
+          f"({(s_tput/c_tput - 1)*100:+.1f}%)")
+
+
+def main():
+    p = argparse.ArgumentParser(description=__doc__,
+                                formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--combined", type=Path, required=True)
+    p.add_argument("--separated", type=Path, required=True)
+    args = p.parse_args()
+
+    combined = load_summary(args.combined)
+    separated = load_summary(args.separated)
+    compare(combined, separated)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/compute_roofline.py
+++ b/scripts/compute_roofline.py
@@ -0,0 +1,210 @@
+"""Roofline analysis: compute/memory ratio for prefill vs decode
+under different sequence lengths and KV cache reuse ratios.
+
+Model: Qwen3-Coder-30B-A3B (MoE)
+  - 48 layers, hidden=2048, heads=32, kv_heads=4, head_dim=128
+  - MoE: 128 experts, top-8 active, intermediate=6144
+  - Total params: ~30B, Active params per token: ~3B
+
+GPU: NVIDIA H20
+  - BF16 peak: 148 TFLOPS
+  - HBM bandwidth: 4.0 TB/s
+  - Roofline ridge point: 148/4.0 = 37 FLOP/byte
+"""
+
+import json
+import math
+
+# ===== Model config =====
+L = 48            # layers
+D = 2048          # hidden dim
+H = 32            # attention heads
+H_kv = 4          # KV heads (GQA)
+D_head = 128      # head dim
+D_ffn = 6144      # FFN intermediate (per expert)
+N_experts = 128   # total experts
+K_experts = 8     # active experts per token
+VOCAB = 151936
+BYTES = 2         # BF16
+
+# ===== GPU config (H20) =====
+PEAK_FLOPS = 148e12   # BF16 TFLOPS
+HBM_BW = 4.0e12       # bytes/s
+RIDGE_POINT = PEAK_FLOPS / HBM_BW  # ~37 FLOP/byte
+
+print("=" * 80)
+print("  ROOFLINE ANALYSIS: Prefill vs Decode under KV Cache Reuse")
+print("  Model: Qwen3-Coder-30B-A3B (MoE 128E top-8) | GPU: H20")
+print("=" * 80)
+print(f"  Ridge point: {RIDGE_POINT:.1f} FLOP/byte")
+print(f"  Above ridge → compute-bound | Below ridge → memory-bound")
+
+# ===== Per-token compute & memory for each component =====
+
+def attention_prefill_flops(seq_len, new_tokens):
+    """FLOPs for attention on new_tokens with seq_len context."""
+    # QKV projection: new_tokens * D * (D + 2*D_kv) * 2
+    d_kv = H_kv * D_head
+    qkv_flops = new_tokens * (D * D * 2 + D * d_kv * 2 * 2)  # Q + K + V
+    # Attention score: new_tokens * seq_len * D * 2 (Q@K^T + softmax@V)
+    attn_flops = new_tokens * seq_len * D * 2 * 2  # simplified: 2 matmuls
+    # Output projection: new_tokens * D * D * 2
+    out_flops = new_tokens * D * D * 2
+    return (qkv_flops + attn_flops + out_flops) * L
+
+def attention_prefill_bytes(seq_len, new_tokens, cached_tokens):
+    """Memory access for attention prefill."""
+    d_kv = H_kv * D_head
+    # Load model weights (QKV + O projections): D*(D+2*d_kv+D) * BYTES * L
+    weight_bytes = D * (D + 2 * d_kv + D) * BYTES * L
+    # Load cached KV: cached_tokens * 2 * d_kv * BYTES * L
+    cached_kv_bytes = cached_tokens * 2 * d_kv * BYTES * L
+    # Read input activations + write output: new_tokens * D * BYTES * 2 * L
+    act_bytes = new_tokens * D * BYTES * 2 * L
+    # Write new KV to cache: new_tokens * 2 * d_kv * BYTES * L
+    new_kv_bytes = new_tokens * 2 * d_kv * BYTES * L
+    return weight_bytes + cached_kv_bytes + act_bytes + new_kv_bytes
+
+def ffn_flops(n_tokens):
+    """FLOPs for MoE FFN on n_tokens."""
+    # Per expert: 3 * n_tokens * D * D_ffn * 2 (gate + up + down)
+    # Active experts: K_experts
+    return 3 * n_tokens * D * D_ffn * 2 * K_experts * L
+
+def ffn_bytes(n_tokens):
+    """Memory access for MoE FFN."""
+    # Load K_experts worth of weights per layer: K * 3 * D * D_ffn * BYTES
+    weight_bytes = K_experts * 3 * D * D_ffn * BYTES * L
+    # Activations: n_tokens * D * BYTES * 2 * L
+    act_bytes = n_tokens * D * BYTES * 2 * L
+    return weight_bytes + act_bytes
+
+def decode_flops(seq_len):
+    """FLOPs for 1 decode token."""
+    return attention_prefill_flops(seq_len, 1) + ffn_flops(1)
+
+def decode_bytes(seq_len):
+    """Memory bytes for 1 decode token."""
+    return attention_prefill_bytes(seq_len, 1, seq_len) + ffn_bytes(1)
+
+# ===== Analysis =====
+
+print("\n" + "-" * 80)
+print("  PART 1: Decode Roofline (baseline)")
+print("-" * 80)
+print(f"  {'SeqLen':>8} {'FLOP':>14} {'Bytes':>14} {'AI (F/B)':>10} {'Bound':>12}")
+
+for seq_len in [1000, 4000, 8000, 16000, 32000, 64000, 128000]:
+    flops = decode_flops(seq_len)
+    bytes_ = decode_bytes(seq_len)
+    ai = flops / bytes_
+    bound = "COMPUTE" if ai > RIDGE_POINT else "MEMORY"
+    print(f"  {seq_len:>8,} {flops:>14.2e} {bytes_:>14.2e} {ai:>10.1f} {bound:>12}")
+
+print("\n" + "-" * 80)
+print("  PART 2: Prefill with KV Cache Reuse")
+print("  (Total input = seq_len, cached = seq_len * reuse_ratio, new = rest)")
+print("-" * 80)
+print(f"  {'SeqLen':>8} {'Reuse%':>7} {'NewTok':>8} {'FLOP':>14} {'Bytes':>14} {'AI (F/B)':>10} {'Bound':>12} {'vs Decode':>10}")
+
+for seq_len in [4000, 16000, 32000, 64000, 128000]:
+    for reuse in [0.0, 0.3, 0.5, 0.7, 0.9, 0.95]:
+        cached = int(seq_len * reuse)
+        new = seq_len - cached
+
+        # Attention: compute on new tokens, but read cached KV for context
+        attn_f = attention_prefill_flops(seq_len, new)
+        attn_b = attention_prefill_bytes(seq_len, new, cached)
+
+        # FFN: only on new tokens
+        ffn_f = ffn_flops(new)
+        ffn_b = ffn_bytes(new)
+
+        total_f = attn_f + ffn_f
+        total_b = attn_b + ffn_b
+        ai = total_f / total_b if total_b > 0 else 0
+
+        # Compare with decode at same seq_len
+        dec_f = decode_flops(seq_len)
+        dec_b = decode_bytes(seq_len)
+        dec_ai = dec_f / dec_b
+
+        bound = "COMPUTE" if ai > RIDGE_POINT else "MEMORY"
+        ratio = f"{ai/dec_ai:.1f}x" if dec_ai > 0 else "N/A"
+
+        print(f"  {seq_len:>8,} {reuse*100:>6.0f}% {new:>8,} {total_f:>14.2e} {total_b:>14.2e} {ai:>10.1f} {bound:>12} {ratio:>10}")
+    print()
+
+print("-" * 80)
+print("  PART 3: Key Thresholds")
+print("-" * 80)
+
+# At what reuse ratio does prefill become memory-bound?
+for seq_len in [4000, 16000, 32000, 64000, 128000]:
+    for reuse_pct in range(0, 100):
+        reuse = reuse_pct / 100.0
+        cached = int(seq_len * reuse)
+        new = seq_len - cached
+        if new < 1: continue
+        attn_f = attention_prefill_flops(seq_len, new)
+        attn_b = attention_prefill_bytes(seq_len, new, cached)
+        ffn_f = ffn_flops(new)
+        ffn_b = ffn_bytes(new)
+        ai = (attn_f + ffn_f) / (attn_b + ffn_b)
+        if ai < RIDGE_POINT:
+            print(f"  SeqLen={seq_len:>6,}: prefill becomes memory-bound at {reuse_pct}% reuse (AI={ai:.1f})")
+            break
+
+print()
+print("-" * 80)
+print("  PART 4: Agentic Workload Real Distribution")
+print("-" * 80)
+
+# Use actual trace data
+import os
+trace_path = "traces/sampled_1000req_seed42.jsonl"
+if os.path.exists(trace_path):
+    BLOCK_SIZE = 512
+    seen = set()
+    compute_bound = 0
+    memory_bound = 0
+    total = 0
+
+    for line in open(trace_path):
+        d = json.loads(line)
+        seq_len = d["input_length"]
+        if seq_len < 1: continue
+        hids = d.get("hash_ids", [])
+
+        cached_blocks = 0
+        for hid in hids:
+            if hid in seen:
+                cached_blocks += 1
+            else:
+                break
+        for hid in hids:
+            seen.add(hid)
+
+        cached = cached_blocks * BLOCK_SIZE
+        new = max(1, seq_len - cached)
+        reuse = cached / seq_len
+
+        attn_f = attention_prefill_flops(seq_len, new)
+        attn_b = attention_prefill_bytes(seq_len, new, cached)
+        ffn_f = ffn_flops(new)
+        ffn_b = ffn_bytes(new)
+        ai = (attn_f + ffn_f) / (attn_b + ffn_b)
+
+        total += 1
+        if ai > RIDGE_POINT:
+            compute_bound += 1
+        else:
+            memory_bound += 1
+
+    print(f"  With actual trace prefix cache pattern:")
+    print(f"    Compute-bound prefills: {compute_bound} ({compute_bound*100//total}%)")
+    print(f"    Memory-bound prefills:  {memory_bound} ({memory_bound*100//total}%)")
+    print(f"    (Decode is ALWAYS memory-bound at these seq lengths)")
+    print()
+    print(f"  Implication: {memory_bound*100//total}% of agentic prefills behave like decode")
+    print(f"  → PD separation treats them as 'compute-heavy' but they are actually memory-heavy")
--- a/scripts/final_comparison.py
+++ b/scripts/final_comparison.py
@@ -0,0 +1,86 @@
+"""Final comparison of PD-Combined vs PD-Separated (Mooncake/RDMA)."""
+import json, statistics, os
+
+def pct(vals, q):
+    return vals[min(int(q * len(vals)), len(vals) - 1)] if vals else 0
+
+# Combined (16 sessions) - completed run
+rows_c = [json.loads(l) for l in open("outputs/v18_combined_1000req/metrics.jsonl")]
+ok_c = [r for r in rows_c if not r.get("error")]
+ttfts_c = sorted([r["ttft_s"] for r in ok_c if r.get("ttft_s")])
+tpots_c = sorted([r["tpot_s"] for r in ok_c if r.get("tpot_s") and r["tpot_s"] > 0])
+lats_c = sorted([r["latency_s"] for r in ok_c if r.get("latency_s")])
+sc = json.load(open("outputs/v18_combined_1000req/metrics.summary.json"))
+
+# PD-Separated Mooncake (first 200 stable requests)
+rows_d = [json.loads(l) for l in open("outputs/v18_pd_mooncake_lowconc/metrics.jsonl")][:200]
+ok_d = [r for r in rows_d if not r.get("error")]
+ttfts_d = sorted([r["ttft_s"] for r in ok_d if r.get("ttft_s")])
+tpots_d = sorted([r["tpot_s"] for r in ok_d if r.get("tpot_s") and r["tpot_s"] > 0])
+lats_d = sorted([r["latency_s"] for r in ok_d if r.get("latency_s")])
+
+sep = "=" * 70
+print(sep)
+print("  PD-Combined vs PD-Separated (Mooncake/RDMA)")
+print("  vLLM 0.18.1 | Qwen3-Coder-30B-A3B | 8xH20")
+print(sep)
+
+header = "  {:<12} {:>16} {:>16} {:>10}".format(
+    "Metric", "Combined(TP=8)", "PD-Sep(TP=4+4)", "Delta")
+print(header)
+dash = "  {:<12} {:>16} {:>16} {:>10}".format("-" * 12, "-" * 16, "-" * 16, "-" * 10)
+print(dash)
+
+req_c = "{}/{}".format(len(ok_c), len(rows_c))
+req_d = "{}/{}".format(len(ok_d), len(rows_d))
+print("  {:<12} {:>16} {:>16}".format("Requests", req_c, req_d))
+
+data = [
+    ("TTFT p50", pct(ttfts_c, 0.5), pct(ttfts_d, 0.5)),
+    ("TTFT p90", pct(ttfts_c, 0.9), pct(ttfts_d, 0.9)),
+    ("TPOT p50", pct(tpots_c, 0.5), pct(tpots_d, 0.5)),
+    ("TPOT p90", pct(tpots_c, 0.9), pct(tpots_d, 0.9)),
+    ("E2E p50", pct(lats_c, 0.5), pct(lats_d, 0.5)),
+    ("E2E p90", pct(lats_c, 0.9), pct(lats_d, 0.9)),
+]
+
+for label, cv, dv in data:
+    delta = "{:+.0f}%".format((dv / cv - 1) * 100) if cv > 0 else "N/A"
+    print("  {:<12} {:>15.3f}s {:>15.3f}s {:>10}".format(label, cv, dv, delta))
+
+cache_c = sc.get("prefix_cache_hit_ratio", 0)
+print("  {:<12} {:>15.1f}% {:>16}".format("Cache hit", cache_c * 100, "N/A"))
+tput_c = len(ok_c) / sc.get("wall_clock_s", 1)
+print("  {:<12} {:>14.2f}/s {:>16}".format("Throughput", tput_c, "~0.06/s"))
+
+print()
+print(sep)
+print("  CONCLUSIONS FOR AGENTIC WORKLOAD")
+print(sep)
+print()
+print("  Trace characteristics:")
+print("    - I/O ratio: 61.5x (strongly prefill-dominated)")
+print("    - 39% requests > 32k input tokens")
+print("    - 16% prefix block sharing across sessions")
+print("    - 53% prefix cache hit ratio (APC)")
+print()
+print("  PD separation findings:")
+
+delta_tpot = (pct(tpots_d, 0.5) / pct(tpots_c, 0.5) - 1) * 100 if tpots_c else 0
+delta_ttft = (pct(ttfts_d, 0.5) / pct(ttfts_c, 0.5) - 1) * 100 if ttfts_c else 0
+delta_e2e = (pct(lats_d, 0.5) / pct(lats_c, 0.5) - 1) * 100 if lats_c else 0
+
+print("    1. TPOT {:+.0f}% - decode isolation benefit is {}".format(
+    delta_tpot, "marginal" if abs(delta_tpot) < 20 else "significant"))
+print("    2. TTFT {:+.0f}% - KV transfer + TP=4 overhead dominates".format(delta_ttft))
+print("    3. E2E  {:+.0f}% - net negative on single-machine".format(delta_e2e))
+print("    4. Stability: Mooncake connector crashes after ~200 reqs under load")
+print()
+print("  Recommendation:")
+print("    - Single-machine 8 GPU: Combined mode is better (lower TTFT, stable)")
+print("    - Multi-machine: PD-Sep is promising IF cross-machine latency")
+print("      is hidden by RDMA and prefill doesn't share GPU with decode")
+print("    - Key bottleneck: this workload's heavy prefill (avg 32k tokens)")
+print("      makes KV transfer cost non-trivial relative to prefill time")
+print("    - Prefill-as-a-Service (Goal 5) should focus on cross-machine")
+print("      KV cache sharing, not same-machine PD split")
--- a/scripts/launch_pd_mooncake.sh
+++ b/scripts/launch_pd_mooncake.sh
@@ -0,0 +1,96 @@
+#!/bin/bash
+# PD-Disaggregated serving via Mooncake (RDMA + DRAM KV pool).
+#
+# Architecture:
+#   Client → Proxy (port 8000)
+#     → Prefill (port 8010, TP=4, GPUs 0-3, bootstrap 8998)
+#       [prefill + store KV to DRAM pool via RDMA]
+#     → Decode  (port 8020, TP=4, GPUs 4-7)
+#       [pull KV from DRAM pool via RDMA + decode]
+#
+# Usage: bash scripts/launch_pd_mooncake.sh
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
+VENV="$PROJECT_DIR/.venv/bin"
+VLLM="$VENV/vllm"
+
+MODEL_PATH="${MODEL_PATH:-$HOME/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
+PROXY_PORT=8000
+PREFILL_PORT=8010
+DECODE_PORT=8020
+BOOTSTRAP_PORT=8998
+
+PROXY_SCRIPT="$PROJECT_DIR/third_party/vllm/examples/online_serving/disaggregated_serving/mooncake_connector/mooncake_connector_proxy.py"
+
+trap 'echo "Cleaning up..."; kill $(jobs -p) 2>/dev/null; wait 2>/dev/null' EXIT INT TERM
+
+echo "=== PD-Disaggregated vLLM 0.18.1 (Mooncake/RDMA) ==="
+echo "  Model:     $MODEL_PATH"
+echo "  Prefill:   GPUs 0-3 (TP=4), port $PREFILL_PORT, bootstrap $BOOTSTRAP_PORT"
+echo "  Decode:    GPUs 4-7 (TP=4), port $DECODE_PORT"
+echo "  Proxy:     port $PROXY_PORT"
+echo ""
+
+# Step 1: Start prefill instance (KV producer)
+echo "[1/3] Starting prefill instance..."
+VLLM_MOONCAKE_BOOTSTRAP_PORT=$BOOTSTRAP_PORT \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+$VLLM serve "$MODEL_PATH" \
+    --host 0.0.0.0 \
+    --port $PREFILL_PORT \
+    --tensor-parallel-size 4 \
+    --trust-remote-code \
+    --enable-prefix-caching \
+    --enforce-eager \
+    --dtype auto \
+    --gpu-memory-utilization 0.9 \
+    --kv-transfer-config \
+    '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}' &
+PREFILL_PID=$!
+echo "  Prefill PID=$PREFILL_PID"
+
+# Step 2: Start decode instance (KV consumer)
+echo "[2/3] Starting decode instance..."
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+$VLLM serve "$MODEL_PATH" \
+    --host 0.0.0.0 \
+    --port $DECODE_PORT \
+    --tensor-parallel-size 4 \
+    --trust-remote-code \
+    --enable-prefix-caching \
+    --enforce-eager \
+    --dtype auto \
+    --gpu-memory-utilization 0.8 \
+    --kv-transfer-config \
+    '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}' &
+DECODE_PID=$!
+echo "  Decode PID=$DECODE_PID"
+
+# Wait for both instances
+echo ""
+echo "Waiting for instances..."
+timeout 1200 bash -c "until curl -s localhost:$PREFILL_PORT/v1/models > /dev/null 2>&1; do sleep 5; done"
+echo "  Prefill ready!"
+timeout 1200 bash -c "until curl -s localhost:$DECODE_PORT/v1/models > /dev/null 2>&1; do sleep 5; done"
+echo "  Decode ready!"
+
+# Step 3: Start proxy (after instances are ready)
+echo "[3/3] Starting proxy..."
+$VENV/python "$PROXY_SCRIPT" \
+    --prefill "http://127.0.0.1:$PREFILL_PORT" "$BOOTSTRAP_PORT" \
+    --decode "http://127.0.0.1:$DECODE_PORT" \
+    --host 0.0.0.0 \
+    --port $PROXY_PORT &
+PROXY_PID=$!
+echo "  Proxy PID=$PROXY_PID"
+
+sleep 5
+echo ""
+echo "=== All ready ==="
+echo "  Send requests to: http://localhost:$PROXY_PORT"
+echo ""
+
+wait
--- a/scripts/launch_pd_separated.sh
+++ b/scripts/launch_pd_separated.sh
@@ -0,0 +1,89 @@
+#!/bin/bash
+# PD-Disaggregated serving: 1 prefill (TP=4, GPUs 0-3) + 1 decode (TP=4, GPUs 4-7)
+# Uses vLLM 0.18.1's P2pNcclConnector + XpYd proxy.
+#
+# Architecture:
+#   Client → Proxy (port 10001)
+#     → Prefill (port 20003, kv_port 21001) [max_tokens=1, does prefill + KV push]
+#     → Decode  (port 20005, kv_port 22001) [full generation, KV pulled from prefill]
+#
+# Usage: bash scripts/launch_pd_separated.sh
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
+VENV="$PROJECT_DIR/.venv/bin"
+VLLM="$VENV/vllm"
+
+MODEL_PATH="${MODEL_PATH:-$HOME/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
+PROXY_PORT=30001      # ZMQ service discovery
+CLIENT_PORT=10001     # HTTP proxy for clients
+PREFILL_PORT=20003
+DECODE_PORT=20005
+KV_PORT_P=21001
+KV_PORT_D=22001
+
+trap 'echo "Cleaning up..."; kill $(jobs -p) 2>/dev/null; wait 2>/dev/null' EXIT INT TERM
+
+echo "=== PD-Disaggregated vLLM 0.18.1 ==="
+echo "  Model:   $MODEL_PATH"
+echo "  Prefill: GPUs 0-3 (TP=4), port $PREFILL_PORT, kv_port $KV_PORT_P"
+echo "  Decode:  GPUs 4-7 (TP=4), port $DECODE_PORT, kv_port $KV_PORT_D"
+echo "  Proxy:   ZMQ=$PROXY_PORT, HTTP=$CLIENT_PORT"
+echo ""
+
+# Step 1: Start proxy FIRST (P/D instances register via ZMQ)
+echo "[1/3] Starting proxy..."
+PROXY_SCRIPT="$PROJECT_DIR/third_party/vllm/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_proxy_p2p_nccl_xpyd.py"
+$VENV/python "$PROXY_SCRIPT" &
+PROXY_PID=$!
+sleep 2
+echo "  Proxy PID=$PROXY_PID"
+
+# Step 2: Start prefill instance (KV producer)
+echo "[2/3] Starting prefill instance..."
+CUDA_VISIBLE_DEVICES=0,1,2,3 $VLLM serve "$MODEL_PATH" \
+    --host 0.0.0.0 \
+    --port $PREFILL_PORT \
+    --tensor-parallel-size 4 \
+    --trust-remote-code \
+    --enable-prefix-caching \
+    --enforce-eager \
+    --dtype auto \
+    --gpu-memory-utilization 0.9 \
+    --kv-transfer-config \
+    "{\"kv_connector\":\"P2pNcclConnector\",\"kv_role\":\"kv_producer\",\"kv_buffer_size\":\"1e1\",\"kv_port\":\"$KV_PORT_P\",\"kv_connector_extra_config\":{\"proxy_ip\":\"127.0.0.1\",\"proxy_port\":\"$PROXY_PORT\",\"http_port\":\"$PREFILL_PORT\",\"send_type\":\"PUT_ASYNC\",\"nccl_num_channels\":\"16\"}}" &
+PREFILL_PID=$!
+echo "  Prefill PID=$PREFILL_PID"
+
+# Step 3: Start decode instance (KV consumer)
+echo "[3/3] Starting decode instance..."
+CUDA_VISIBLE_DEVICES=4,5,6,7 $VLLM serve "$MODEL_PATH" \
+    --host 0.0.0.0 \
+    --port $DECODE_PORT \
+    --tensor-parallel-size 4 \
+    --trust-remote-code \
+    --enable-prefix-caching \
+    --enforce-eager \
+    --dtype auto \
+    --gpu-memory-utilization 0.8 \
+    --kv-transfer-config \
+    "{\"kv_connector\":\"P2pNcclConnector\",\"kv_role\":\"kv_consumer\",\"kv_buffer_size\":\"8e9\",\"kv_port\":\"$KV_PORT_D\",\"kv_connector_extra_config\":{\"proxy_ip\":\"127.0.0.1\",\"proxy_port\":\"$PROXY_PORT\",\"http_port\":\"$DECODE_PORT\",\"send_type\":\"PUT_ASYNC\",\"nccl_num_channels\":\"16\"}}" &
+DECODE_PID=$!
+echo "  Decode PID=$DECODE_PID"
+
+# Wait for readiness
+echo ""
+echo "Waiting for instances..."
+timeout 1200 bash -c "until curl -s localhost:$PREFILL_PORT/v1/completions > /dev/null 2>&1; do sleep 5; done"
+echo "  Prefill ready!"
+timeout 1200 bash -c "until curl -s localhost:$DECODE_PORT/v1/completions > /dev/null 2>&1; do sleep 5; done"
+echo "  Decode ready!"
+
+echo ""
+echo "=== All ready ==="
+echo "  Send requests to: http://localhost:$CLIENT_PORT"
+echo ""
+
+wait
--- a/scripts/launch_vllm.sh
+++ b/scripts/launch_vllm.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+# Launch vLLM 0.18.1 in PD-combined mode (TP=8, all GPUs).
+#
+# Usage: bash scripts/launch_vllm.sh
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
+VLLM="$PROJECT_DIR/.venv/bin/vllm"
+
+MODEL_PATH="${MODEL_PATH:-$HOME/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
+HOST="${HOST:-0.0.0.0}"
+PORT="${PORT:-8000}"
+
+echo "Starting vLLM 0.18.1 in PD-combined mode (TP=8) on port $PORT ..."
+$VLLM serve "$MODEL_PATH" \
+    --trust-remote-code \
+    --enable-prefix-caching \
+    --dtype auto \
+    --tensor-parallel-size 8 \
+    --host "$HOST" \
+    --port "$PORT"
--- a/scripts/run_benchmark.sh
+++ b/scripts/run_benchmark.sh
@@ -0,0 +1,77 @@
+#!/bin/bash
+# Run the full benchmark suite: sample trace → replay against vLLM → collect metrics.
+#
+# Prerequisites:
+#   - vLLM server running (use scripts/launch_vllm.sh)
+#   - Sampled trace file exists (or will be created)
+#
+# Usage:
+#   bash scripts/run_benchmark.sh [--endpoint URL] [--tag NAME]
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
+cd "$PROJECT_DIR"
+
+# Defaults
+TRACE_INPUT="${TRACE_INPUT:-$HOME/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl}"
+ENDPOINT="${ENDPOINT:-http://localhost:8000}"
+TAG="${TAG:-default}"
+TARGET_REQUESTS="${TARGET_REQUESTS:-5000}"
+TIME_SCALE="${TIME_SCALE:-1.0}"
+MAX_INFLIGHT="${MAX_INFLIGHT:-32}"
+SEED="${SEED:-42}"
+
+# Parse args
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --endpoint) ENDPOINT="$2"; shift 2 ;;
+        --tag) TAG="$2"; shift 2 ;;
+        --target-requests) TARGET_REQUESTS="$2"; shift 2 ;;
+        --time-scale) TIME_SCALE="$2"; shift 2 ;;
+        --max-inflight) MAX_INFLIGHT="$2"; shift 2 ;;
+        *) echo "Unknown arg: $1"; exit 1 ;;
+    esac
+done
+
+SAMPLED_TRACE="traces/sampled_${TARGET_REQUESTS}req_seed${SEED}.jsonl"
+OUTPUT_DIR="outputs/${TAG}_$(date +%Y%m%d_%H%M%S)"
+
+echo "=== Benchmark: tag=$TAG ==="
+echo "  Trace: $TRACE_INPUT"
+echo "  Endpoint: $ENDPOINT"
+echo "  Target requests: $TARGET_REQUESTS"
+echo "  Time scale: $TIME_SCALE"
+echo "  Max inflight sessions: $MAX_INFLIGHT"
+
+# Step 1: Sample trace (if not already done)
+if [ ! -f "$SAMPLED_TRACE" ]; then
+    echo ""
+    echo "=== Step 1: Sampling trace ==="
+    python scripts/sample_trace.py \
+        --input "$TRACE_INPUT" \
+        --output "$SAMPLED_TRACE" \
+        --target-requests "$TARGET_REQUESTS" \
+        --seed "$SEED"
+else
+    echo ""
+    echo "=== Step 1: Using existing sampled trace: $SAMPLED_TRACE ==="
+fi
+
+# Step 2: Run replay
+echo ""
+echo "=== Step 2: Replaying trace ==="
+mkdir -p "$OUTPUT_DIR"
+python -m replayer \
+    --trace "$SAMPLED_TRACE" \
+    --output "$OUTPUT_DIR/metrics.jsonl" \
+    --endpoint "$ENDPOINT" \
+    --time-scale "$TIME_SCALE" \
+    --max-inflight-sessions "$MAX_INFLIGHT" \
+    -v
+
+echo ""
+echo "=== Done ==="
+echo "  Metrics: $OUTPUT_DIR/metrics.jsonl"
+echo "  Summary: $OUTPUT_DIR/metrics.summary.json"
--- a/scripts/run_experiments.sh
+++ b/scripts/run_experiments.sh
@@ -0,0 +1,254 @@
+#!/bin/bash
+# Run the complete experiment matrix:
+#   1. Combined TP=2 DP=4 (4 instances, baseline)
+#   2. Combined TP=1 DP=8 (8 instances, max throughput)
+#   3. PD-Sep TP=1: P×4 + D×4 via Mooncake/RDMA
+#
+# All use the same trace, same concurrency, same timeout.
+
+set -euo pipefail
+
+PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+VENV="$PROJECT_DIR/.venv/bin"
+VLLM="$VENV/vllm"
+PYTHON="$VENV/python"
+
+MODEL="${MODEL_PATH:-$HOME/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
+TRACE="$PROJECT_DIR/traces/sampled_1000req_seed42.jsonl"
+
+# Uniform benchmark params
+MAX_SESSIONS=${MAX_SESSIONS:-8}
+MAX_CONCURRENT=${MAX_CONCURRENT:-16}
+TIME_SCALE=10
+REQUEST_TIMEOUT=${REQUEST_TIMEOUT:-300}
+REQUEST_LIMIT="${REQUEST_LIMIT:-}"  # empty = all 1000
+
+cleanup_gpu() {
+    pkill -9 -f "vllm" 2>/dev/null || true
+    pkill -9 -f "cache_aware_proxy\|mooncake_connector_proxy\|uvicorn" 2>/dev/null || true
+    fuser 9090/tcp 8000/tcp 2>/dev/null | xargs -r kill -9 2>/dev/null || true
+    sleep 5
+    fuser /dev/nvidia* 2>/dev/null | tr " " "\n" | sort -u | xargs -r kill -9 2>/dev/null || true
+    sleep 10
+}
+
+wait_for_server() {
+    local port=$1
+    local timeout=${2:-600}
+    timeout "$timeout" bash -c "until curl -s localhost:$port/v1/models >/dev/null 2>&1; do sleep 5; done"
+}
+
+run_benchmark() {
+    local tag=$1
+    local endpoint=$2
+    local extra_args="${3:-}"
+    local outdir="$PROJECT_DIR/outputs/$tag"
+
+    echo "  Running benchmark -> $outdir"
+    local limit_arg=""
+    if [ -n "$REQUEST_LIMIT" ]; then
+        limit_arg="--request-limit $REQUEST_LIMIT"
+    fi
+
+    $PYTHON -m replayer \
+        --trace "$TRACE" \
+        --output "$outdir/metrics.jsonl" \
+        --endpoint "$endpoint" \
+        --model "$MODEL" \
+        --time-scale $TIME_SCALE \
+        --max-inflight-sessions $MAX_SESSIONS \
+        --concurrency-limit $MAX_CONCURRENT \
+        --request-timeout $REQUEST_TIMEOUT \
+        $limit_arg \
+        -v
+
+    echo "  Done: $(wc -l < "$outdir/metrics.jsonl") requests"
+}
+
+#######################################################################
+# Experiment 1: Combined TP=2 DP=4
+#######################################################################
+run_combined_tp2_dp4() {
+    echo ""
+    echo "================================================================"
+    echo "  Experiment 1: Combined TP=2 DP=4 (4 instances on 8 GPUs)"
+    echo "================================================================"
+    cleanup_gpu
+
+    for i in 0 1 2 3; do
+        local gpu_start=$((i * 2))
+        local gpu_end=$((gpu_start + 1))
+        local port=$((8000 + i))
+        echo "  Starting instance $i: GPUs $gpu_start,$gpu_end, port $port"
+        CUDA_VISIBLE_DEVICES=$gpu_start,$gpu_end $VLLM serve "$MODEL" \
+            --host 0.0.0.0 --port $port \
+            --tensor-parallel-size 2 \
+            --trust-remote-code --enable-prefix-caching --enforce-eager \
+            --dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 &
+    done
+
+    for i in 0 1 2 3; do
+        wait_for_server $((8000 + i))
+        echo "    Instance $i ready"
+    done
+    echo "  All 4 instances ready"
+
+    # Start global scheduler (cache-aware proxy in combined mode)
+    echo "  Starting global scheduler..."
+    $PYTHON "$PROJECT_DIR/scripts/cache_aware_proxy.py" \
+        --combined http://127.0.0.1:8000 http://127.0.0.1:8001 http://127.0.0.1:8002 http://127.0.0.1:8003 \
+        --port 9090 &
+    sleep 5
+
+    run_benchmark "exp1_combined_tp2_dp4" "http://localhost:9090"
+}
+
+#######################################################################
+# Experiment 2: Combined TP=1 DP=8
+#######################################################################
+run_combined_tp1_dp8() {
+    echo ""
+    echo "================================================================"
+    echo "  Experiment 2: Combined TP=1 DP=8 (8 instances on 8 GPUs)"
+    echo "================================================================"
+    cleanup_gpu
+
+    for i in $(seq 0 7); do
+        local port=$((8000 + i))
+        echo "  Starting instance $i: GPU $i, port $port"
+        CUDA_VISIBLE_DEVICES=$i $VLLM serve "$MODEL" \
+            --host 0.0.0.0 --port $port \
+            --tensor-parallel-size 1 \
+            --trust-remote-code --enable-prefix-caching --enforce-eager \
+            --dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 &
+    done
+
+    for i in $(seq 0 7); do
+        wait_for_server $((8000 + i))
+        echo "    Instance $i ready"
+    done
+    echo "  All 8 instances ready"
+
+    # Start global scheduler (cache-aware proxy in combined mode)
+    echo "  Starting global scheduler..."
+    $PYTHON "$PROJECT_DIR/scripts/cache_aware_proxy.py" \
+        --combined http://127.0.0.1:8000 http://127.0.0.1:8001 http://127.0.0.1:8002 http://127.0.0.1:8003 \
+                   http://127.0.0.1:8004 http://127.0.0.1:8005 http://127.0.0.1:8006 http://127.0.0.1:8007 \
+        --port 9090 &
+    sleep 5
+
+    run_benchmark "exp2_combined_tp1_dp8" "http://localhost:9090"
+}
+
+#######################################################################
+# Experiment 3: PD-Sep TP=1 P×4 D×4 (Mooncake/RDMA)
+#######################################################################
+run_pd_sep_tp1() {
+    echo ""
+    echo "================================================================"
+    echo "  Experiment 3: PD-Sep TP=1 P×4 + D×4 (Mooncake/RDMA)"
+    echo "================================================================"
+    cleanup_gpu
+
+    PROXY_SCRIPT="$PROJECT_DIR/scripts/cache_aware_proxy.py"
+
+    # Start 4 prefill instances (GPUs 0-3)
+    local prefill_args=""
+    for i in 0 1 2 3; do
+        local port=$((8010 + i))
+        local bootstrap=$((8998 + i))
+        echo "  Prefill $i: GPU $i, port $port, bootstrap $bootstrap"
+        VLLM_MOONCAKE_BOOTSTRAP_PORT=$bootstrap \
+        CUDA_VISIBLE_DEVICES=$i $VLLM serve "$MODEL" \
+            --host 0.0.0.0 --port $port \
+            --tensor-parallel-size 1 \
+            --trust-remote-code --enable-prefix-caching --enforce-eager \
+            --dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
+            --kv-transfer-config \
+            "{\"kv_connector\":\"MooncakeConnector\",\"kv_role\":\"kv_producer\"}" &
+        prefill_args="$prefill_args --prefill http://127.0.0.1:$port $bootstrap"
+    done
+
+    # Start 4 decode instances (GPUs 4-7)
+    local decode_args=""
+    for i in 0 1 2 3; do
+        local gpu=$((4 + i))
+        local port=$((8020 + i))
+        echo "  Decode $i: GPU $gpu, port $port"
+        CUDA_VISIBLE_DEVICES=$gpu $VLLM serve "$MODEL" \
+            --host 0.0.0.0 --port $port \
+            --tensor-parallel-size 1 \
+            --trust-remote-code --enable-prefix-caching --enforce-eager \
+            --dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
+            --kv-transfer-config \
+            "{\"kv_connector\":\"MooncakeConnector\",\"kv_role\":\"kv_consumer\",\"kv_load_failure_policy\":\"recompute\"}" &
+        decode_args="$decode_args --decode http://127.0.0.1:$port"
+    done
+
+    # Wait for all instances
+    for i in 0 1 2 3; do
+        wait_for_server $((8010 + i))
+        echo "    Prefill $i ready"
+    done
+    for i in 0 1 2 3; do
+        wait_for_server $((8020 + i))
+        echo "    Decode $i ready"
+    done
+
+    # Start proxy (wait for bootstrap to be queryable first)
+    echo "  Waiting for bootstrap servers..."
+    for bp in 8998 8999 9000 9001; do
+        timeout 120 bash -c "until curl -s localhost:$bp/query > /dev/null 2>&1; do sleep 2; done"
+        echo "    Bootstrap $bp ready"
+    done
+
+    echo "  Starting proxy on port 9000..."
+    $PYTHON "$PROXY_SCRIPT" $prefill_args $decode_args --host 0.0.0.0 --port 9090 &
+    sleep 15
+
+    # Smoke test with retry
+    echo "  Smoke test..."
+    for attempt in 1 2 3; do
+        result=$(curl -s -m 120 http://localhost:9090/v1/completions \
+            -X POST -H "Content-Type: application/json" \
+            -d "{\"model\":\"$MODEL\",\"prompt\":[100,200,300],\"max_tokens\":3,\"temperature\":0}" 2>&1)
+        if echo "$result" | grep -q "choices"; then
+            echo "  Smoke test passed!"
+            break
+        fi
+        echo "  Attempt $attempt failed, retrying..."
+        sleep 10
+    done
+
+    run_benchmark "exp3_pd_sep_tp1_mooncake" "http://localhost:9090"
+}
+
+#######################################################################
+# Main
+#######################################################################
+echo "Starting experiment matrix on $(hostname)"
+echo "Model: $MODEL"
+echo "Trace: $TRACE"
+echo "Params: sessions=$MAX_SESSIONS, concurrent=$MAX_CONCURRENT, time_scale=$TIME_SCALE"
+echo ""
+
+case "${1:-all}" in
+    1|tp2dp4)  run_combined_tp2_dp4 ;;
+    2|tp1dp8)  run_combined_tp1_dp8 ;;
+    3|pdsep)   run_pd_sep_tp1 ;;
+    all)
+        run_combined_tp2_dp4
+        run_combined_tp1_dp8
+        run_pd_sep_tp1
+        ;;
+    *)
+        echo "Usage: $0 {1|2|3|all|tp2dp4|tp1dp8|pdsep}"
+        exit 1
+        ;;
+esac
+
+echo ""
+echo "================================================================"
+echo "  All experiments complete!"
+echo "================================================================"
+cleanup_gpu
--- a/scripts/sample_trace.py
+++ b/scripts/sample_trace.py
@@ -0,0 +1,204 @@
+"""Sample sessions from the full cluster-scale trace to fit a single machine.
+
+Preserves:
+  - Complete session structure (all turns within a session kept together)
+  - Original arrival timing (inter-session and intra-session gaps)
+  - hash_ids for KV cache reuse patterns
+  - Request type distribution
+
+Sampling strategy:
+  1. Group requests by session (derived from parent_chat_id chains)
+  2. Randomly sample N sessions (or until target request count reached)
+  3. Re-zero timestamps so first event starts at t=0
+  4. Optionally compress time axis to increase load density
+
+Usage:
+    python scripts/sample_trace.py \\
+        --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \\
+        --output traces/sampled.jsonl \\
+        --target-requests 5000 \\
+        --seed 42
+"""
+
+from __future__ import annotations
+
+import argparse
+import collections
+import json
+import random
+import sys
+from pathlib import Path
+
+
+def load_raw_rows(path: Path) -> dict[str, list[dict]]:
+    """Load trace, group rows by resolved session_id. Preserve file order."""
+    chat_to_session: dict[int, str] = {}
+    rows_by_session: dict[str, list[dict]] = collections.OrderedDict()
+
+    with path.open("r", encoding="utf-8") as fh:
+        for line in fh:
+            row = json.loads(line)
+            cid = int(row["chat_id"])
+            pid = int(row["parent_chat_id"])
+
+            if "session_id" in row:
+                sid = str(row["session_id"])
+            elif pid < 0:
+                sid = str(cid)
+            else:
+                sid = chat_to_session.get(pid, str(pid))
+            chat_to_session[cid] = sid
+
+            row["_session_id"] = sid
+            rows_by_session.setdefault(sid, []).append(row)
+
+    return rows_by_session
+
+
+def sample_sessions(
+    rows_by_session: dict[str, list[dict]],
+    *,
+    target_requests: int,
+    seed: int,
+    strategy: str = "random",
+) -> list[str]:
+    """Select sessions until target request count is reached."""
+    all_sids = list(rows_by_session.keys())
+    rng = random.Random(seed)
+
+    if strategy == "random":
+        rng.shuffle(all_sids)
+    elif strategy == "sequential":
+        pass  # keep file order
+    else:
+        raise ValueError(f"Unknown strategy: {strategy}")
+
+    selected = []
+    total = 0
+    for sid in all_sids:
+        selected.append(sid)
+        total += len(rows_by_session[sid])
+        if total >= target_requests:
+            break
+
+    return selected
+
+
+def build_output(
+    rows_by_session: dict[str, list[dict]],
+    selected: list[str],
+    *,
+    time_scale: float = 1.0,
+) -> list[dict]:
+    """Build output rows with re-zeroed timestamps."""
+    out_rows = []
+    for sid in selected:
+        for row in rows_by_session[sid]:
+            out = {k: v for k, v in row.items() if not k.startswith("_")}
+            out["session_id"] = sid
+            out_rows.append(out)
+
+    out_rows.sort(key=lambda r: float(r["timestamp"]))
+
+    if not out_rows:
+        return out_rows
+
+    # Re-zero: subtract earliest timestamp
+    t0 = float(out_rows[0]["timestamp"])
+    for row in out_rows:
+        row["timestamp"] = (float(row["timestamp"]) - t0) / time_scale
+
+    return out_rows
+
+
+def print_summary(
+    rows_by_session: dict[str, list[dict]],
+    selected: list[str],
+    out_rows: list[dict],
+) -> None:
+    n_sessions = len(selected)
+    n_requests = len(out_rows)
+    turns_per_session = [len(rows_by_session[s]) for s in selected]
+    multi_turn = sum(1 for t in turns_per_session if t > 1)
+
+    input_lens = [r["input_length"] for r in out_rows]
+    output_lens = [r["output_length"] for r in out_rows]
+
+    span_s = float(out_rows[-1]["timestamp"]) if out_rows else 0
+    session_starts = {}
+    for r in out_rows:
+        sid = r["session_id"]
+        ts = float(r["timestamp"])
+        if sid not in session_starts:
+            session_starts[sid] = ts
+    starts_sorted = sorted(session_starts.values())
+    deltas = [starts_sorted[i+1] - starts_sorted[i]
+              for i in range(len(starts_sorted) - 1)]
+
+    # hash_ids overlap: count unique hash_ids across all requests
+    all_hashes = set()
+    for r in out_rows:
+        all_hashes.update(r.get("hash_ids", []))
+
+    print(f"Sampled: {n_sessions} sessions, {n_requests} requests")
+    print(f"  Multi-turn sessions: {multi_turn} ({multi_turn/n_sessions*100:.1f}%)")
+    print(f"  Turns/session: min={min(turns_per_session)} max={max(turns_per_session)} "
+          f"avg={sum(turns_per_session)/len(turns_per_session):.1f}")
+    print(f"  Input length: min={min(input_lens)} max={max(input_lens)} "
+          f"avg={sum(input_lens)/len(input_lens):.0f}")
+    print(f"  Output length: min={min(output_lens)} max={max(output_lens)} "
+          f"avg={sum(output_lens)/len(output_lens):.0f}")
+    print(f"  Trace span: {span_s:.1f}s ({span_s/60:.1f} min)")
+    print(f"  Unique hash blocks: {len(all_hashes)}")
+    if deltas:
+        deltas.sort()
+        p = lambda q: deltas[min(int(q * len(deltas)), len(deltas) - 1)]
+        print(f"  Session arrival deltas (s): p10={p(0.1):.2f} p50={p(0.5):.2f} "
+              f"p90={p(0.9):.2f} max={max(deltas):.2f}")
+
+
+def main() -> None:
+    p = argparse.ArgumentParser(description=__doc__,
+                                formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--input", type=Path, required=True,
+                   help="Path to the full trace JSONL file")
+    p.add_argument("--output", type=Path, required=True,
+                   help="Path to write sampled trace JSONL")
+    p.add_argument("--target-requests", type=int, default=5000,
+                   help="Target number of requests (stops after session that crosses it)")
+    p.add_argument("--strategy", choices=["random", "sequential"], default="random",
+                   help="Session selection strategy")
+    p.add_argument("--time-scale", type=float, default=1.0,
+                   help="Compress time axis by this factor (>1 = faster arrival)")
+    p.add_argument("--seed", type=int, default=42)
+    args = p.parse_args()
+
+    print(f"Loading trace from {args.input} ...")
+    rows_by_session = load_raw_rows(args.input)
+    total_sessions = len(rows_by_session)
+    total_requests = sum(len(v) for v in rows_by_session.values())
+    print(f"Full trace: {total_sessions} sessions, {total_requests} requests")
+
+    selected = sample_sessions(
+        rows_by_session,
+        target_requests=args.target_requests,
+        seed=args.seed,
+        strategy=args.strategy,
+    )
+
+    out_rows = build_output(
+        rows_by_session, selected,
+        time_scale=args.time_scale,
+    )
+
+    print_summary(rows_by_session, selected, out_rows)
+
+    args.output.parent.mkdir(parents=True, exist_ok=True)
+    with args.output.open("w", encoding="utf-8") as fh:
+        for row in out_rows:
+            fh.write(json.dumps(row, ensure_ascii=False) + "\n")
+    print(f"\nWrote {len(out_rows)} rows to {args.output}")
+
+
+if __name__ == "__main__":
+    main()