# PD Transfer Lifecycle Breakdown Microbenchmark

## Goal

Profile the **complete request lifecycle** under PD disaggregation, with emphasis on the P→D KV transfer stage. Produce a per-phase latency breakdown as a function of three independent variables:

```
breakdown(prior_context, current_new_tokens, output_length) → {
    routing_ms, p_queue_ms, p_prefill_ms, 
    zmq_handshake_ms, rdma_transfer_ms, transfer_completion_signal_ms,
    d_block_alloc_ms, d_cache_promotion_ms, d_schedule_ms, d_first_decode_ms,
    d_decode_total_ms
}
```

---

## Background: vLLM PD Transfer Semantics

Transfer is **incremental**:
- D uses its local prefix cache for prior turns (blocks with matching hashes)
- P only transfers the **delta**: `ext_tokens = remote_total - D_local_cache_hits`
- D combines: local prefix cache + remote-transferred blocks + locally-computed remainder

Therefore, `prior_context` (already cached on D) determines how much P actually transfers.

---

## Hardware & Model

| Parameter | Value |
|-----------|-------|
| GPUs | 2× H20 96GB (1 for P, 1 for D), NVLink/RDMA connected |
| Model | Qwen3-Coder-30B-A3B-Instruct |
| TP | 1 per instance |
| Transfer | Mooncake (`kv_producer` / `kv_consumer`) |
| `enable_prefix_caching` | true |
| `enable_chunked_prefill` | true |
| `max_num_batched_tokens` | 8192 |
| `gpu_memory_utilization` | 0.9 |

---

## Independent Variables

| Variable | Symbol | Values | Meaning |
|----------|--------|--------|---------|
| Prior context (D-side cached) | `C` | 0, 4k, 16k, 32k, 64k, 100k | Tokens from prior turns, already in D's prefix cache |
| Current new tokens | `N` | 512, 2k, 4k, 8k, 16k, 32k | Tokens P must prefill and transfer (the delta) |
| Output length | `O` | 1, 32, 128, 512 | Decode tokens D generates after receiving KV |

Sweep: 6 × 6 × 4 = 144 configurations.

**Total input_length per request** = `C + N`.

---

## Lifecycle Phases & Instrumentation

### Phase Diagram

```
Time ─────────────────────────────────────────────────────────────────────►

[Routing] [P Queue] [P Prefill (chunked)] [Transfer] [D Startup] [D Decode]
   t0       t1          t2          t3       t4  t5     t6  t7      t8   t9

t0: Request arrives at proxy/router
t1: Request dispatched to P instance (P receives HTTP request)
t2: P scheduler picks up request (first prefill chunk starts)
t3: P prefill completes (last chunk done, all KV blocks ready)
t4: P sends ZMQ metadata to D (or D sends block alloc to P)
t5: First RDMA write issued
t6: Last RDMA write completes (all blocks landed on D GPU)
t7: D receives completion signal (ZMQ response parsed)
t8: D scheduler promotes request from WAITING_FOR_REMOTE_KVS → schedulable
t9: D first decode token emitted
t10: D final output token emitted
```

### Instrumentation Points

| Timestamp | Where to instrument | Method |
|-----------|-------------------|--------|
| `t0` | Proxy `pick_instance()` entry | Proxy log with `time.perf_counter_ns()` |
| `t1` | Proxy forwards to P (HTTP send complete) | Proxy log |
| `t2` | P scheduler `schedule()` — request leaves WAITING | vLLM patch: log in scheduler |
| `t3` | P `request_finished()` or `save_kv_layer` last layer | vLLM patch: log in connector `record_send_reqs` |
| `t4` | P `send_kv_to_decode`: ZMQ metadata received by handler | Connector log: before `_build_transfer_params` |
| `t5` | P `batch_transfer_sync_write` entry | Connector log |
| `t6` | P `batch_transfer_sync_write` return | Connector log (ret_value == 0) |
| `t7` | D `process_pulling_result`: `finished_recving_reqs.add()` | Connector log |
| `t8` | D scheduler `_try_promote_blocked_waiting_request` success | Scheduler log |
| `t9` | D first token streamed to client | Client-side SSE timestamp |
| `t10` | D last token streamed to client | Client-side SSE timestamp |

### Derived Metrics

| Metric | Formula | What it tells us |
|--------|---------|-----------------|
| `routing_latency` | t1 - t0 | Proxy overhead |
| `p_queue_time` | t2 - t1 | P scheduling delay |
| `p_prefill_time` | t3 - t2 | Actual prefill compute (chunked) |
| `zmq_handshake` | t5 - t3 | ZMQ coordination overhead (P ready → RDMA starts) |
| `rdma_transfer_time` | t6 - t5 | Pure RDMA data movement |
| `transfer_signal_latency` | t7 - t6 | Completion detection (ZMQ response + asyncio poll) |
| `d_promotion_latency` | t8 - t7 | Scheduler step delay until promotion |
| `d_first_token_latency` | t9 - t8 | D compute startup (1 token forward + sampling) |
| `d_decode_time` | t10 - t9 | Decode generation (O-1 tokens) |
| **`transfer_total`** | t7 - t3 | **End-to-end transfer overhead** (the key number) |
| **`ttft_overhead_vs_colo`** | t9 - t0 - p_prefill_time | Extra latency vs if the same request ran on combined instance |

---

## Transfer Internal Breakdown

For the `rdma_transfer_time` phase, instrument further:

| Sub-phase | How to measure |
|-----------|---------------|
| `build_transfer_params` | Time `_build_transfer_params()` call |
| `rdma_write_submit` | Time from `batch_transfer_sync_write` entry to first RDMA CQ completion (if available) |
| `rdma_write_total` | Full `batch_transfer_sync_write` duration |
| `bytes_transferred` | `sum(lengths)` from transfer params |
| `num_rdma_ops` | `len(src_ptrs)` (number of RDMA write operations) |
| `effective_bandwidth` | `bytes_transferred / rdma_write_total` |
| `num_layers_transferred` | Count of unique layers in transfer |
| `num_blocks_transferred` | Count of blocks |

Expected relationships:
- `bytes_transferred = num_blocks × block_size_bytes × num_layers`
- `block_size_bytes = 16 tokens × 2(KV) × num_kv_heads × head_dim × dtype_size`
- `rdma_transfer_time ≈ bytes_transferred / RDMA_bandwidth + per_op_latency × num_ops`

---

## Protocol

### Setup: Warm D's Prefix Cache

To control `prior_context` (C), we need D to have prior-turn KV in its local prefix cache:

```
Phase 0: Seed D's cache
  1. For each config with C > 0:
     - Send a request with C-token prompt directly to D (combined mode, no PD-sep)
     - Let it generate 1 token → D now has C tokens in prefix cache
     - Verify via /metrics that prefix cache utilization increased
  2. Switch D to kv_consumer mode (or keep combined + use kv_transfer_params override)
```

Alternative: Use D in `kv_both` mode (combined + Mooncake enabled), then send PD-sep requests with `kv_transfer_params` that explicitly request remote prefill.

### Main Experiment

```
For C in [0, 4k, 16k, 32k, 64k, 100k]:
    Seed D's prefix cache with C tokens (Phase 0)
    
    For N in [512, 2k, 4k, 8k, 16k, 32k]:
        For O in [1, 32, 128, 512]:
            Construct request:
                input = C_token_prefix + N_random_new_tokens  (total = C+N)
                max_tokens = O
                kv_transfer_params = {do_remote_prefill: true, ...}
            
            Send request through proxy → P → D
            Collect all timestamps (t0..t10)
            Repeat 5 times
            
            Record breakdown
    
    Evict D's cache (restart or send cache-clearing requests)
```

### D-Side Cache Verification

Before each measurement, verify D's cache state:
```python
# Check that D has exactly C tokens cached
resp = httpx.get(f"http://{d_host}:{d_port}/metrics")
# Parse vllm:prefix_cache_hit_rate or gpu_prefix_cache_hit_rate
# Or use internal API to query cached block count
```

---

## vLLM Instrumentation Patch

Minimal patch to `mooncake_connector.py` for timestamp collection:

```python
# Add at top
import time
_PROFILE_LOG = []  # or write to file

# In send_kv_to_decode(), around line 800-990:
async def send_kv_to_decode(self, ...):
    t_ready = time.perf_counter_ns()  # P prefill done, ready to send
    
    # ... ZMQ receive metadata from D ...
    t_zmq_recv = time.perf_counter_ns()
    
    # ... build transfer params ...
    t_params_built = time.perf_counter_ns()
    
    ret_value = self.engine.batch_transfer_sync_write(...)
    t_rdma_done = time.perf_counter_ns()
    
    # ... send ZMQ response ...
    t_zmq_sent = time.perf_counter_ns()
    
    _PROFILE_LOG.append({
        "req_id": req_id,
        "bytes": sum(lengths),
        "num_ops": len(src_ptrs),
        "t_ready": t_ready,
        "t_zmq_recv": t_zmq_recv,
        "t_params_built": t_params_built, 
        "t_rdma_done": t_rdma_done,
        "t_zmq_sent": t_zmq_sent,
    })
```

Similar patches needed in:
- `scheduler.py`: Log `t_schedule_start`, `t_promote`
- `process_pulling_result()`: Log `t_recv_complete`

---

## Output Format

### Per-Request Record (`results/lifecycle/C{c}_N{n}_O{o}_rep{r}.json`)

```json
{
    "config": {
        "prior_context": 32000,
        "current_new_tokens": 8192,
        "output_length": 128,
        "total_input_length": 40192
    },
    "timestamps_ns": {
        "t0_proxy_recv": 1000000000,
        "t1_proxy_dispatch": 1000050000,
        "t2_p_schedule": 1000200000,
        "t3_p_prefill_done": 1001100000,
        "t4_zmq_metadata": 1001150000,
        "t5_rdma_start": 1001200000,
        "t6_rdma_complete": 1002300000,
        "t7_d_recv_signal": 1002350000,
        "t8_d_promoted": 1002500000,
        "t9_d_first_token": 1002600000,
        "t10_d_last_token": 1003800000
    },
    "breakdown_ms": {
        "routing": 0.05,
        "p_queue": 0.15,
        "p_prefill": 0.90,
        "zmq_handshake": 0.05,
        "rdma_transfer": 1.10,
        "transfer_signal": 0.05,
        "d_promotion": 0.15,
        "d_first_token": 0.10,
        "d_decode": 1.20,
        "transfer_total": 1.20,
        "e2e": 3.80
    },
    "transfer_detail": {
        "bytes_transferred": 268435456,
        "num_rdma_ops": 512,
        "num_blocks": 512,
        "num_layers": 32,
        "build_params_ms": 0.8,
        "rdma_write_ms": 1100.0,
        "effective_bw_gbps": 195.2
    }
}
```

### Aggregated Summary (`results/lifecycle/summary.csv`)

```csv
prior_context,new_tokens,output_length,p_prefill_ms,rdma_transfer_ms,transfer_total_ms,d_decode_ms,e2e_ms,bytes_GB,bw_gbps,ttft_overhead_ms
0,8192,128,890,1100,1200,480,2620,0.268,195,1200
32000,8192,128,890,1100,1200,480,2620,0.268,195,1200
64000,8192,128,890,1100,1200,480,2620,0.268,195,1200
0,32768,128,3200,4400,4500,480,8230,1.073,195,4500
```

---

## Analysis Deliverables

### 1. Stacked Bar Chart: Lifecycle Breakdown vs N (new tokens)

X-axis: `current_new_tokens`
Y-axis: Time (ms)
Stacked bars: routing | p_queue | p_prefill | zmq | rdma_transfer | signal | d_promotion | d_decode

Separate subplot rows for each `prior_context` value.

### 2. Transfer Bandwidth Characterization

Plot `effective_bandwidth` vs `bytes_transferred`:
- Expected: bandwidth increases with transfer size (amortizes per-op latency)
- Identify the "bandwidth knee" — minimum transfer size for near-peak bandwidth
- Compare against theoretical 200 Gbps RDMA limit

### 3. Transfer Cost Model

Fit: `rdma_transfer_ms = α + β × bytes_transferred`
- α = per-operation fixed cost (ZMQ + scheduling)
- β = 1/bandwidth (bytes → time)

### 4. Overhead vs Co-Located Baseline

For each config, also measure the same request on a **combined** (no PD-sep) instance:
- `colo_ttft` = time from request to first token on combined instance
- `pdsep_overhead = pdsep_ttft - colo_ttft`

Plot: overhead as function of (C, N) — when does PD-sep become net-negative?

### 5. Impact of Prior Context on Transfer Volume

Since transfer is incremental:
- When `C` increases (D has more cached), actual `bytes_transferred` should stay constant (≈ N × per_token_kv_size)
- Verify this — if NOT constant, there's a bug in incremental transfer logic
- Plot actual `bytes_transferred` vs C for fixed N

---

## Risks & Mitigations

| Risk | Impact | Mitigation |
|------|--------|------------|
| Clock skew between P and D processes | Wrong cross-instance durations | Use single-machine 2-GPU setup, share `time.perf_counter_ns()` clock |
| P and D scheduler step async | Promotion delayed by step interval | Record D's scheduler step frequency, subtract 0.5×step from d_promotion |
| Prefix cache eviction during experiment | C not actually cached | Monitor cache metrics, use small enough working set |
| Mooncake connection pool warmup | First transfer slower | Discard first 2 repetitions, use reps 3-5 |
| vLLM internal queuing at high C+N | OOM or scheduling delays | Monitor `gpu_cache_usage_perc`, keep C+N ≤ 132k |

---

## Execution Estimate

| Phase | Time |
|-------|------|
| vLLM patch development & validation | 2 hours |
| Per configuration (5 reps × ~10s each) | ~50s |
| Full sweep (144 configs × 50s) | ~2 hours |
| Cache seeding overhead (6 prior_context levels) | ~30 min |
| Analysis & plotting | 2 hours |
| **Total** | **~7 hours** |

---

## Success Criteria

1. **Breakdown is complete**: All phases sum to E2E (residual < 5%)
2. **Transfer dominates**: `rdma_transfer_ms > p_prefill_ms` for N ≥ 4k (confirms current bottleneck hypothesis)
3. **Bandwidth model fits**: Linear model R² > 0.95
4. **Incremental verified**: `bytes_transferred` independent of `prior_context` for fixed N
5. **Overhead quantified**: Clear threshold (N, C) where PD-sep overhead exceeds co-located execution