# PD Transfer Lifecycle Breakdown Microbenchmark ## Goal Profile the **complete request lifecycle** under PD disaggregation, with emphasis on the P→D KV transfer stage. Produce a per-phase latency breakdown as a function of three independent variables: ``` breakdown(prior_context, current_new_tokens, output_length) → { routing_ms, p_queue_ms, p_prefill_ms, zmq_handshake_ms, rdma_transfer_ms, transfer_completion_signal_ms, d_block_alloc_ms, d_cache_promotion_ms, d_schedule_ms, d_first_decode_ms, d_decode_total_ms } ``` --- ## Background: vLLM PD Transfer Semantics Transfer is **incremental**: - D uses its local prefix cache for prior turns (blocks with matching hashes) - P only transfers the **delta**: `ext_tokens = remote_total - D_local_cache_hits` - D combines: local prefix cache + remote-transferred blocks + locally-computed remainder Therefore, `prior_context` (already cached on D) determines how much P actually transfers. --- ## Hardware & Model | Parameter | Value | |-----------|-------| | GPUs | 2× H20 96GB (1 for P, 1 for D), NVLink/RDMA connected | | Model | Qwen3-Coder-30B-A3B-Instruct | | TP | 1 per instance | | Transfer | Mooncake (`kv_producer` / `kv_consumer`) | | `enable_prefix_caching` | true | | `enable_chunked_prefill` | true | | `max_num_batched_tokens` | 8192 | | `gpu_memory_utilization` | 0.9 | --- ## Independent Variables | Variable | Symbol | Values | Meaning | |----------|--------|--------|---------| | Prior context (D-side cached) | `C` | 0, 4k, 16k, 32k, 64k, 100k | Tokens from prior turns, already in D's prefix cache | | Current new tokens | `N` | 512, 2k, 4k, 8k, 16k, 32k | Tokens P must prefill and transfer (the delta) | | Output length | `O` | 1, 32, 128, 512 | Decode tokens D generates after receiving KV | Sweep: 6 × 6 × 4 = 144 configurations. **Total input_length per request** = `C + N`. --- ## Lifecycle Phases & Instrumentation ### Phase Diagram ``` Time ─────────────────────────────────────────────────────────────────────► [Routing] [P Queue] [P Prefill (chunked)] [Transfer] [D Startup] [D Decode] t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t0: Request arrives at proxy/router t1: Request dispatched to P instance (P receives HTTP request) t2: P scheduler picks up request (first prefill chunk starts) t3: P prefill completes (last chunk done, all KV blocks ready) t4: P sends ZMQ metadata to D (or D sends block alloc to P) t5: First RDMA write issued t6: Last RDMA write completes (all blocks landed on D GPU) t7: D receives completion signal (ZMQ response parsed) t8: D scheduler promotes request from WAITING_FOR_REMOTE_KVS → schedulable t9: D first decode token emitted t10: D final output token emitted ``` ### Instrumentation Points | Timestamp | Where to instrument | Method | |-----------|-------------------|--------| | `t0` | Proxy `pick_instance()` entry | Proxy log with `time.perf_counter_ns()` | | `t1` | Proxy forwards to P (HTTP send complete) | Proxy log | | `t2` | P scheduler `schedule()` — request leaves WAITING | vLLM patch: log in scheduler | | `t3` | P `request_finished()` or `save_kv_layer` last layer | vLLM patch: log in connector `record_send_reqs` | | `t4` | P `send_kv_to_decode`: ZMQ metadata received by handler | Connector log: before `_build_transfer_params` | | `t5` | P `batch_transfer_sync_write` entry | Connector log | | `t6` | P `batch_transfer_sync_write` return | Connector log (ret_value == 0) | | `t7` | D `process_pulling_result`: `finished_recving_reqs.add()` | Connector log | | `t8` | D scheduler `_try_promote_blocked_waiting_request` success | Scheduler log | | `t9` | D first token streamed to client | Client-side SSE timestamp | | `t10` | D last token streamed to client | Client-side SSE timestamp | ### Derived Metrics | Metric | Formula | What it tells us | |--------|---------|-----------------| | `routing_latency` | t1 - t0 | Proxy overhead | | `p_queue_time` | t2 - t1 | P scheduling delay | | `p_prefill_time` | t3 - t2 | Actual prefill compute (chunked) | | `zmq_handshake` | t5 - t3 | ZMQ coordination overhead (P ready → RDMA starts) | | `rdma_transfer_time` | t6 - t5 | Pure RDMA data movement | | `transfer_signal_latency` | t7 - t6 | Completion detection (ZMQ response + asyncio poll) | | `d_promotion_latency` | t8 - t7 | Scheduler step delay until promotion | | `d_first_token_latency` | t9 - t8 | D compute startup (1 token forward + sampling) | | `d_decode_time` | t10 - t9 | Decode generation (O-1 tokens) | | **`transfer_total`** | t7 - t3 | **End-to-end transfer overhead** (the key number) | | **`ttft_overhead_vs_colo`** | t9 - t0 - p_prefill_time | Extra latency vs if the same request ran on combined instance | --- ## Transfer Internal Breakdown For the `rdma_transfer_time` phase, instrument further: | Sub-phase | How to measure | |-----------|---------------| | `build_transfer_params` | Time `_build_transfer_params()` call | | `rdma_write_submit` | Time from `batch_transfer_sync_write` entry to first RDMA CQ completion (if available) | | `rdma_write_total` | Full `batch_transfer_sync_write` duration | | `bytes_transferred` | `sum(lengths)` from transfer params | | `num_rdma_ops` | `len(src_ptrs)` (number of RDMA write operations) | | `effective_bandwidth` | `bytes_transferred / rdma_write_total` | | `num_layers_transferred` | Count of unique layers in transfer | | `num_blocks_transferred` | Count of blocks | Expected relationships: - `bytes_transferred = num_blocks × block_size_bytes × num_layers` - `block_size_bytes = 16 tokens × 2(KV) × num_kv_heads × head_dim × dtype_size` - `rdma_transfer_time ≈ bytes_transferred / RDMA_bandwidth + per_op_latency × num_ops` --- ## Protocol ### Setup: Warm D's Prefix Cache To control `prior_context` (C), we need D to have prior-turn KV in its local prefix cache: ``` Phase 0: Seed D's cache 1. For each config with C > 0: - Send a request with C-token prompt directly to D (combined mode, no PD-sep) - Let it generate 1 token → D now has C tokens in prefix cache - Verify via /metrics that prefix cache utilization increased 2. Switch D to kv_consumer mode (or keep combined + use kv_transfer_params override) ``` Alternative: Use D in `kv_both` mode (combined + Mooncake enabled), then send PD-sep requests with `kv_transfer_params` that explicitly request remote prefill. ### Main Experiment ``` For C in [0, 4k, 16k, 32k, 64k, 100k]: Seed D's prefix cache with C tokens (Phase 0) For N in [512, 2k, 4k, 8k, 16k, 32k]: For O in [1, 32, 128, 512]: Construct request: input = C_token_prefix + N_random_new_tokens (total = C+N) max_tokens = O kv_transfer_params = {do_remote_prefill: true, ...} Send request through proxy → P → D Collect all timestamps (t0..t10) Repeat 5 times Record breakdown Evict D's cache (restart or send cache-clearing requests) ``` ### D-Side Cache Verification Before each measurement, verify D's cache state: ```python # Check that D has exactly C tokens cached resp = httpx.get(f"http://{d_host}:{d_port}/metrics") # Parse vllm:prefix_cache_hit_rate or gpu_prefix_cache_hit_rate # Or use internal API to query cached block count ``` --- ## vLLM Instrumentation Patch Minimal patch to `mooncake_connector.py` for timestamp collection: ```python # Add at top import time _PROFILE_LOG = [] # or write to file # In send_kv_to_decode(), around line 800-990: async def send_kv_to_decode(self, ...): t_ready = time.perf_counter_ns() # P prefill done, ready to send # ... ZMQ receive metadata from D ... t_zmq_recv = time.perf_counter_ns() # ... build transfer params ... t_params_built = time.perf_counter_ns() ret_value = self.engine.batch_transfer_sync_write(...) t_rdma_done = time.perf_counter_ns() # ... send ZMQ response ... t_zmq_sent = time.perf_counter_ns() _PROFILE_LOG.append({ "req_id": req_id, "bytes": sum(lengths), "num_ops": len(src_ptrs), "t_ready": t_ready, "t_zmq_recv": t_zmq_recv, "t_params_built": t_params_built, "t_rdma_done": t_rdma_done, "t_zmq_sent": t_zmq_sent, }) ``` Similar patches needed in: - `scheduler.py`: Log `t_schedule_start`, `t_promote` - `process_pulling_result()`: Log `t_recv_complete` --- ## Output Format ### Per-Request Record (`results/lifecycle/C{c}_N{n}_O{o}_rep{r}.json`) ```json { "config": { "prior_context": 32000, "current_new_tokens": 8192, "output_length": 128, "total_input_length": 40192 }, "timestamps_ns": { "t0_proxy_recv": 1000000000, "t1_proxy_dispatch": 1000050000, "t2_p_schedule": 1000200000, "t3_p_prefill_done": 1001100000, "t4_zmq_metadata": 1001150000, "t5_rdma_start": 1001200000, "t6_rdma_complete": 1002300000, "t7_d_recv_signal": 1002350000, "t8_d_promoted": 1002500000, "t9_d_first_token": 1002600000, "t10_d_last_token": 1003800000 }, "breakdown_ms": { "routing": 0.05, "p_queue": 0.15, "p_prefill": 0.90, "zmq_handshake": 0.05, "rdma_transfer": 1.10, "transfer_signal": 0.05, "d_promotion": 0.15, "d_first_token": 0.10, "d_decode": 1.20, "transfer_total": 1.20, "e2e": 3.80 }, "transfer_detail": { "bytes_transferred": 268435456, "num_rdma_ops": 512, "num_blocks": 512, "num_layers": 32, "build_params_ms": 0.8, "rdma_write_ms": 1100.0, "effective_bw_gbps": 195.2 } } ``` ### Aggregated Summary (`results/lifecycle/summary.csv`) ```csv prior_context,new_tokens,output_length,p_prefill_ms,rdma_transfer_ms,transfer_total_ms,d_decode_ms,e2e_ms,bytes_GB,bw_gbps,ttft_overhead_ms 0,8192,128,890,1100,1200,480,2620,0.268,195,1200 32000,8192,128,890,1100,1200,480,2620,0.268,195,1200 64000,8192,128,890,1100,1200,480,2620,0.268,195,1200 0,32768,128,3200,4400,4500,480,8230,1.073,195,4500 ``` --- ## Analysis Deliverables ### 1. Stacked Bar Chart: Lifecycle Breakdown vs N (new tokens) X-axis: `current_new_tokens` Y-axis: Time (ms) Stacked bars: routing | p_queue | p_prefill | zmq | rdma_transfer | signal | d_promotion | d_decode Separate subplot rows for each `prior_context` value. ### 2. Transfer Bandwidth Characterization Plot `effective_bandwidth` vs `bytes_transferred`: - Expected: bandwidth increases with transfer size (amortizes per-op latency) - Identify the "bandwidth knee" — minimum transfer size for near-peak bandwidth - Compare against theoretical 200 Gbps RDMA limit ### 3. Transfer Cost Model Fit: `rdma_transfer_ms = α + β × bytes_transferred` - α = per-operation fixed cost (ZMQ + scheduling) - β = 1/bandwidth (bytes → time) ### 4. Overhead vs Co-Located Baseline For each config, also measure the same request on a **combined** (no PD-sep) instance: - `colo_ttft` = time from request to first token on combined instance - `pdsep_overhead = pdsep_ttft - colo_ttft` Plot: overhead as function of (C, N) — when does PD-sep become net-negative? ### 5. Impact of Prior Context on Transfer Volume Since transfer is incremental: - When `C` increases (D has more cached), actual `bytes_transferred` should stay constant (≈ N × per_token_kv_size) - Verify this — if NOT constant, there's a bug in incremental transfer logic - Plot actual `bytes_transferred` vs C for fixed N --- ## Risks & Mitigations | Risk | Impact | Mitigation | |------|--------|------------| | Clock skew between P and D processes | Wrong cross-instance durations | Use single-machine 2-GPU setup, share `time.perf_counter_ns()` clock | | P and D scheduler step async | Promotion delayed by step interval | Record D's scheduler step frequency, subtract 0.5×step from d_promotion | | Prefix cache eviction during experiment | C not actually cached | Monitor cache metrics, use small enough working set | | Mooncake connection pool warmup | First transfer slower | Discard first 2 repetitions, use reps 3-5 | | vLLM internal queuing at high C+N | OOM or scheduling delays | Monitor `gpu_cache_usage_perc`, keep C+N ≤ 132k | --- ## Execution Estimate | Phase | Time | |-------|------| | vLLM patch development & validation | 2 hours | | Per configuration (5 reps × ~10s each) | ~50s | | Full sweep (144 configs × 50s) | ~2 hours | | Cache seeding overhead (6 prior_context levels) | ~30 min | | Analysis & plotting | 2 hours | | **Total** | **~7 hours** | --- ## Success Criteria 1. **Breakdown is complete**: All phases sum to E2E (residual < 5%) 2. **Transfer dominates**: `rdma_transfer_ms > p_prefill_ms` for N ≥ 4k (confirms current bottleneck hypothesis) 3. **Bandwidth model fits**: Linear model R² > 0.95 4. **Incremental verified**: `bytes_transferred` independent of `prior_context` for fixed N 5. **Overhead quantified**: Clear threshold (N, C) where PD-sep overhead exceeds co-located execution