Two microbenchmarks quantifying the elastic offload decision:
1. Interference (corrected): cold prefill causes 14-214x TPOT p90
degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
Earlier run had a prefix-cache bug (deterministic prompts hit cache
after rep 0); fixed with uuid+time_ns unique prompts.
2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
measuring prefill→RDMA→decode startup overhead.
Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
13 KiB
PD Transfer Lifecycle Breakdown Microbenchmark
Goal
Profile the complete request lifecycle under PD disaggregation, with emphasis on the P→D KV transfer stage. Produce a per-phase latency breakdown as a function of three independent variables:
breakdown(prior_context, current_new_tokens, output_length) → {
routing_ms, p_queue_ms, p_prefill_ms,
zmq_handshake_ms, rdma_transfer_ms, transfer_completion_signal_ms,
d_block_alloc_ms, d_cache_promotion_ms, d_schedule_ms, d_first_decode_ms,
d_decode_total_ms
}
Background: vLLM PD Transfer Semantics
Transfer is incremental:
- D uses its local prefix cache for prior turns (blocks with matching hashes)
- P only transfers the delta:
ext_tokens = remote_total - D_local_cache_hits - D combines: local prefix cache + remote-transferred blocks + locally-computed remainder
Therefore, prior_context (already cached on D) determines how much P actually transfers.
Hardware & Model
| Parameter | Value |
|---|---|
| GPUs | 2× H20 96GB (1 for P, 1 for D), NVLink/RDMA connected |
| Model | Qwen3-Coder-30B-A3B-Instruct |
| TP | 1 per instance |
| Transfer | Mooncake (kv_producer / kv_consumer) |
enable_prefix_caching |
true |
enable_chunked_prefill |
true |
max_num_batched_tokens |
8192 |
gpu_memory_utilization |
0.9 |
Independent Variables
| Variable | Symbol | Values | Meaning |
|---|---|---|---|
| Prior context (D-side cached) | C |
0, 4k, 16k, 32k, 64k, 100k | Tokens from prior turns, already in D's prefix cache |
| Current new tokens | N |
512, 2k, 4k, 8k, 16k, 32k | Tokens P must prefill and transfer (the delta) |
| Output length | O |
1, 32, 128, 512 | Decode tokens D generates after receiving KV |
Sweep: 6 × 6 × 4 = 144 configurations.
Total input_length per request = C + N.
Lifecycle Phases & Instrumentation
Phase Diagram
Time ─────────────────────────────────────────────────────────────────────►
[Routing] [P Queue] [P Prefill (chunked)] [Transfer] [D Startup] [D Decode]
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
t0: Request arrives at proxy/router
t1: Request dispatched to P instance (P receives HTTP request)
t2: P scheduler picks up request (first prefill chunk starts)
t3: P prefill completes (last chunk done, all KV blocks ready)
t4: P sends ZMQ metadata to D (or D sends block alloc to P)
t5: First RDMA write issued
t6: Last RDMA write completes (all blocks landed on D GPU)
t7: D receives completion signal (ZMQ response parsed)
t8: D scheduler promotes request from WAITING_FOR_REMOTE_KVS → schedulable
t9: D first decode token emitted
t10: D final output token emitted
Instrumentation Points
| Timestamp | Where to instrument | Method |
|---|---|---|
t0 |
Proxy pick_instance() entry |
Proxy log with time.perf_counter_ns() |
t1 |
Proxy forwards to P (HTTP send complete) | Proxy log |
t2 |
P scheduler schedule() — request leaves WAITING |
vLLM patch: log in scheduler |
t3 |
P request_finished() or save_kv_layer last layer |
vLLM patch: log in connector record_send_reqs |
t4 |
P send_kv_to_decode: ZMQ metadata received by handler |
Connector log: before _build_transfer_params |
t5 |
P batch_transfer_sync_write entry |
Connector log |
t6 |
P batch_transfer_sync_write return |
Connector log (ret_value == 0) |
t7 |
D process_pulling_result: finished_recving_reqs.add() |
Connector log |
t8 |
D scheduler _try_promote_blocked_waiting_request success |
Scheduler log |
t9 |
D first token streamed to client | Client-side SSE timestamp |
t10 |
D last token streamed to client | Client-side SSE timestamp |
Derived Metrics
| Metric | Formula | What it tells us |
|---|---|---|
routing_latency |
t1 - t0 | Proxy overhead |
p_queue_time |
t2 - t1 | P scheduling delay |
p_prefill_time |
t3 - t2 | Actual prefill compute (chunked) |
zmq_handshake |
t5 - t3 | ZMQ coordination overhead (P ready → RDMA starts) |
rdma_transfer_time |
t6 - t5 | Pure RDMA data movement |
transfer_signal_latency |
t7 - t6 | Completion detection (ZMQ response + asyncio poll) |
d_promotion_latency |
t8 - t7 | Scheduler step delay until promotion |
d_first_token_latency |
t9 - t8 | D compute startup (1 token forward + sampling) |
d_decode_time |
t10 - t9 | Decode generation (O-1 tokens) |
transfer_total |
t7 - t3 | End-to-end transfer overhead (the key number) |
ttft_overhead_vs_colo |
t9 - t0 - p_prefill_time | Extra latency vs if the same request ran on combined instance |
Transfer Internal Breakdown
For the rdma_transfer_time phase, instrument further:
| Sub-phase | How to measure |
|---|---|
build_transfer_params |
Time _build_transfer_params() call |
rdma_write_submit |
Time from batch_transfer_sync_write entry to first RDMA CQ completion (if available) |
rdma_write_total |
Full batch_transfer_sync_write duration |
bytes_transferred |
sum(lengths) from transfer params |
num_rdma_ops |
len(src_ptrs) (number of RDMA write operations) |
effective_bandwidth |
bytes_transferred / rdma_write_total |
num_layers_transferred |
Count of unique layers in transfer |
num_blocks_transferred |
Count of blocks |
Expected relationships:
bytes_transferred = num_blocks × block_size_bytes × num_layersblock_size_bytes = 16 tokens × 2(KV) × num_kv_heads × head_dim × dtype_sizerdma_transfer_time ≈ bytes_transferred / RDMA_bandwidth + per_op_latency × num_ops
Protocol
Setup: Warm D's Prefix Cache
To control prior_context (C), we need D to have prior-turn KV in its local prefix cache:
Phase 0: Seed D's cache
1. For each config with C > 0:
- Send a request with C-token prompt directly to D (combined mode, no PD-sep)
- Let it generate 1 token → D now has C tokens in prefix cache
- Verify via /metrics that prefix cache utilization increased
2. Switch D to kv_consumer mode (or keep combined + use kv_transfer_params override)
Alternative: Use D in kv_both mode (combined + Mooncake enabled), then send PD-sep requests with kv_transfer_params that explicitly request remote prefill.
Main Experiment
For C in [0, 4k, 16k, 32k, 64k, 100k]:
Seed D's prefix cache with C tokens (Phase 0)
For N in [512, 2k, 4k, 8k, 16k, 32k]:
For O in [1, 32, 128, 512]:
Construct request:
input = C_token_prefix + N_random_new_tokens (total = C+N)
max_tokens = O
kv_transfer_params = {do_remote_prefill: true, ...}
Send request through proxy → P → D
Collect all timestamps (t0..t10)
Repeat 5 times
Record breakdown
Evict D's cache (restart or send cache-clearing requests)
D-Side Cache Verification
Before each measurement, verify D's cache state:
# Check that D has exactly C tokens cached
resp = httpx.get(f"http://{d_host}:{d_port}/metrics")
# Parse vllm:prefix_cache_hit_rate or gpu_prefix_cache_hit_rate
# Or use internal API to query cached block count
vLLM Instrumentation Patch
Minimal patch to mooncake_connector.py for timestamp collection:
# Add at top
import time
_PROFILE_LOG = [] # or write to file
# In send_kv_to_decode(), around line 800-990:
async def send_kv_to_decode(self, ...):
t_ready = time.perf_counter_ns() # P prefill done, ready to send
# ... ZMQ receive metadata from D ...
t_zmq_recv = time.perf_counter_ns()
# ... build transfer params ...
t_params_built = time.perf_counter_ns()
ret_value = self.engine.batch_transfer_sync_write(...)
t_rdma_done = time.perf_counter_ns()
# ... send ZMQ response ...
t_zmq_sent = time.perf_counter_ns()
_PROFILE_LOG.append({
"req_id": req_id,
"bytes": sum(lengths),
"num_ops": len(src_ptrs),
"t_ready": t_ready,
"t_zmq_recv": t_zmq_recv,
"t_params_built": t_params_built,
"t_rdma_done": t_rdma_done,
"t_zmq_sent": t_zmq_sent,
})
Similar patches needed in:
scheduler.py: Logt_schedule_start,t_promoteprocess_pulling_result(): Logt_recv_complete
Output Format
Per-Request Record (results/lifecycle/C{c}_N{n}_O{o}_rep{r}.json)
{
"config": {
"prior_context": 32000,
"current_new_tokens": 8192,
"output_length": 128,
"total_input_length": 40192
},
"timestamps_ns": {
"t0_proxy_recv": 1000000000,
"t1_proxy_dispatch": 1000050000,
"t2_p_schedule": 1000200000,
"t3_p_prefill_done": 1001100000,
"t4_zmq_metadata": 1001150000,
"t5_rdma_start": 1001200000,
"t6_rdma_complete": 1002300000,
"t7_d_recv_signal": 1002350000,
"t8_d_promoted": 1002500000,
"t9_d_first_token": 1002600000,
"t10_d_last_token": 1003800000
},
"breakdown_ms": {
"routing": 0.05,
"p_queue": 0.15,
"p_prefill": 0.90,
"zmq_handshake": 0.05,
"rdma_transfer": 1.10,
"transfer_signal": 0.05,
"d_promotion": 0.15,
"d_first_token": 0.10,
"d_decode": 1.20,
"transfer_total": 1.20,
"e2e": 3.80
},
"transfer_detail": {
"bytes_transferred": 268435456,
"num_rdma_ops": 512,
"num_blocks": 512,
"num_layers": 32,
"build_params_ms": 0.8,
"rdma_write_ms": 1100.0,
"effective_bw_gbps": 195.2
}
}
Aggregated Summary (results/lifecycle/summary.csv)
prior_context,new_tokens,output_length,p_prefill_ms,rdma_transfer_ms,transfer_total_ms,d_decode_ms,e2e_ms,bytes_GB,bw_gbps,ttft_overhead_ms
0,8192,128,890,1100,1200,480,2620,0.268,195,1200
32000,8192,128,890,1100,1200,480,2620,0.268,195,1200
64000,8192,128,890,1100,1200,480,2620,0.268,195,1200
0,32768,128,3200,4400,4500,480,8230,1.073,195,4500
Analysis Deliverables
1. Stacked Bar Chart: Lifecycle Breakdown vs N (new tokens)
X-axis: current_new_tokens
Y-axis: Time (ms)
Stacked bars: routing | p_queue | p_prefill | zmq | rdma_transfer | signal | d_promotion | d_decode
Separate subplot rows for each prior_context value.
2. Transfer Bandwidth Characterization
Plot effective_bandwidth vs bytes_transferred:
- Expected: bandwidth increases with transfer size (amortizes per-op latency)
- Identify the "bandwidth knee" — minimum transfer size for near-peak bandwidth
- Compare against theoretical 200 Gbps RDMA limit
3. Transfer Cost Model
Fit: rdma_transfer_ms = α + β × bytes_transferred
- α = per-operation fixed cost (ZMQ + scheduling)
- β = 1/bandwidth (bytes → time)
4. Overhead vs Co-Located Baseline
For each config, also measure the same request on a combined (no PD-sep) instance:
colo_ttft= time from request to first token on combined instancepdsep_overhead = pdsep_ttft - colo_ttft
Plot: overhead as function of (C, N) — when does PD-sep become net-negative?
5. Impact of Prior Context on Transfer Volume
Since transfer is incremental:
- When
Cincreases (D has more cached), actualbytes_transferredshould stay constant (≈ N × per_token_kv_size) - Verify this — if NOT constant, there's a bug in incremental transfer logic
- Plot actual
bytes_transferredvs C for fixed N
Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Clock skew between P and D processes | Wrong cross-instance durations | Use single-machine 2-GPU setup, share time.perf_counter_ns() clock |
| P and D scheduler step async | Promotion delayed by step interval | Record D's scheduler step frequency, subtract 0.5×step from d_promotion |
| Prefix cache eviction during experiment | C not actually cached | Monitor cache metrics, use small enough working set |
| Mooncake connection pool warmup | First transfer slower | Discard first 2 repetitions, use reps 3-5 |
| vLLM internal queuing at high C+N | OOM or scheduling delays | Monitor gpu_cache_usage_perc, keep C+N ≤ 132k |
Execution Estimate
| Phase | Time |
|---|---|
| vLLM patch development & validation | 2 hours |
| Per configuration (5 reps × ~10s each) | ~50s |
| Full sweep (144 configs × 50s) | ~2 hours |
| Cache seeding overhead (6 prior_context levels) | ~30 min |
| Analysis & plotting | 2 hours |
| Total | ~7 hours |
Success Criteria
- Breakdown is complete: All phases sum to E2E (residual < 5%)
- Transfer dominates:
rdma_transfer_ms > p_prefill_msfor N ≥ 4k (confirms current bottleneck hypothesis) - Bandwidth model fits: Linear model R² > 0.95
- Incremental verified:
bytes_transferredindependent ofprior_contextfor fixed N - Overhead quantified: Clear threshold (N, C) where PD-sep overhead exceeds co-located execution