Files

Gahow Wang f784e49c07 Microbench: prefill-decode interference + PD transfer lifecycle

Two microbenchmarks quantifying the elastic offload decision:

1. Interference (corrected): cold prefill causes 14-214x TPOT p90
   degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
   Earlier run had a prefix-cache bug (deterministic prompts hit cache
   after rep 0); fixed with uuid+time_ns unique prompts.

2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
   measuring prefill→RDMA→decode startup overhead.

Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.

2026-05-26 00:57:06 +08:00

13 KiB

Raw Blame History

PD Transfer Lifecycle Breakdown Microbenchmark

Goal

Profile the complete request lifecycle under PD disaggregation, with emphasis on the P→D KV transfer stage. Produce a per-phase latency breakdown as a function of three independent variables:

breakdown(prior_context, current_new_tokens, output_length) → {
    routing_ms, p_queue_ms, p_prefill_ms, 
    zmq_handshake_ms, rdma_transfer_ms, transfer_completion_signal_ms,
    d_block_alloc_ms, d_cache_promotion_ms, d_schedule_ms, d_first_decode_ms,
    d_decode_total_ms
}

Background: vLLM PD Transfer Semantics

Transfer is incremental:

D uses its local prefix cache for prior turns (blocks with matching hashes)
P only transfers the delta: ext_tokens = remote_total - D_local_cache_hits
D combines: local prefix cache + remote-transferred blocks + locally-computed remainder

Therefore, prior_context (already cached on D) determines how much P actually transfers.

Hardware & Model

Parameter	Value
GPUs	2× H20 96GB (1 for P, 1 for D), NVLink/RDMA connected
Model	Qwen3-Coder-30B-A3B-Instruct
TP	1 per instance
Transfer	Mooncake (`kv_producer` / `kv_consumer`)
`enable_prefix_caching`	true
`enable_chunked_prefill`	true
`max_num_batched_tokens`	8192
`gpu_memory_utilization`	0.9

Independent Variables

Variable	Symbol	Values	Meaning
Prior context (D-side cached)	`C`	0, 4k, 16k, 32k, 64k, 100k	Tokens from prior turns, already in D's prefix cache
Current new tokens	`N`	512, 2k, 4k, 8k, 16k, 32k	Tokens P must prefill and transfer (the delta)
Output length	`O`	1, 32, 128, 512	Decode tokens D generates after receiving KV

Sweep: 6 × 6 × 4 = 144 configurations.

Total input_length per request = C + N.

Lifecycle Phases & Instrumentation

Phase Diagram

Time ─────────────────────────────────────────────────────────────────────►

[Routing] [P Queue] [P Prefill (chunked)] [Transfer] [D Startup] [D Decode]
   t0       t1          t2          t3       t4  t5     t6  t7      t8   t9

t0: Request arrives at proxy/router
t1: Request dispatched to P instance (P receives HTTP request)
t2: P scheduler picks up request (first prefill chunk starts)
t3: P prefill completes (last chunk done, all KV blocks ready)
t4: P sends ZMQ metadata to D (or D sends block alloc to P)
t5: First RDMA write issued
t6: Last RDMA write completes (all blocks landed on D GPU)
t7: D receives completion signal (ZMQ response parsed)
t8: D scheduler promotes request from WAITING_FOR_REMOTE_KVS → schedulable
t9: D first decode token emitted
t10: D final output token emitted

Instrumentation Points

Timestamp	Where to instrument	Method
`t0`	Proxy `pick_instance()` entry	Proxy log with `time.perf_counter_ns()`
`t1`	Proxy forwards to P (HTTP send complete)	Proxy log
`t2`	P scheduler `schedule()` — request leaves WAITING	vLLM patch: log in scheduler
`t3`	P `request_finished()` or `save_kv_layer` last layer	vLLM patch: log in connector `record_send_reqs`
`t4`	P `send_kv_to_decode`: ZMQ metadata received by handler	Connector log: before `_build_transfer_params`
`t5`	P `batch_transfer_sync_write` entry	Connector log
`t6`	P `batch_transfer_sync_write` return	Connector log (ret_value == 0)
`t7`	D `process_pulling_result`: `finished_recving_reqs.add()`	Connector log
`t8`	D scheduler `_try_promote_blocked_waiting_request` success	Scheduler log
`t9`	D first token streamed to client	Client-side SSE timestamp
`t10`	D last token streamed to client	Client-side SSE timestamp

Derived Metrics

Metric	Formula	What it tells us
`routing_latency`	t1 - t0	Proxy overhead
`p_queue_time`	t2 - t1	P scheduling delay
`p_prefill_time`	t3 - t2	Actual prefill compute (chunked)
`zmq_handshake`	t5 - t3	ZMQ coordination overhead (P ready → RDMA starts)
`rdma_transfer_time`	t6 - t5	Pure RDMA data movement
`transfer_signal_latency`	t7 - t6	Completion detection (ZMQ response + asyncio poll)
`d_promotion_latency`	t8 - t7	Scheduler step delay until promotion
`d_first_token_latency`	t9 - t8	D compute startup (1 token forward + sampling)
`d_decode_time`	t10 - t9	Decode generation (O-1 tokens)
`transfer_total`	t7 - t3	End-to-end transfer overhead (the key number)
`ttft_overhead_vs_colo`	t9 - t0 - p_prefill_time	Extra latency vs if the same request ran on combined instance

Transfer Internal Breakdown

For the rdma_transfer_time phase, instrument further:

Sub-phase	How to measure
`build_transfer_params`	Time `_build_transfer_params()` call
`rdma_write_submit`	Time from `batch_transfer_sync_write` entry to first RDMA CQ completion (if available)
`rdma_write_total`	Full `batch_transfer_sync_write` duration
`bytes_transferred`	`sum(lengths)` from transfer params
`num_rdma_ops`	`len(src_ptrs)` (number of RDMA write operations)
`effective_bandwidth`	`bytes_transferred / rdma_write_total`
`num_layers_transferred`	Count of unique layers in transfer
`num_blocks_transferred`	Count of blocks

Expected relationships:

bytes_transferred = num_blocks × block_size_bytes × num_layers
block_size_bytes = 16 tokens × 2(KV) × num_kv_heads × head_dim × dtype_size
rdma_transfer_time ≈ bytes_transferred / RDMA_bandwidth + per_op_latency × num_ops

Protocol

Setup: Warm D's Prefix Cache

To control prior_context (C), we need D to have prior-turn KV in its local prefix cache:

Phase 0: Seed D's cache
  1. For each config with C > 0:
     - Send a request with C-token prompt directly to D (combined mode, no PD-sep)
     - Let it generate 1 token → D now has C tokens in prefix cache
     - Verify via /metrics that prefix cache utilization increased
  2. Switch D to kv_consumer mode (or keep combined + use kv_transfer_params override)

Alternative: Use D in kv_both mode (combined + Mooncake enabled), then send PD-sep requests with kv_transfer_params that explicitly request remote prefill.

Main Experiment

For C in [0, 4k, 16k, 32k, 64k, 100k]:
    Seed D's prefix cache with C tokens (Phase 0)
    
    For N in [512, 2k, 4k, 8k, 16k, 32k]:
        For O in [1, 32, 128, 512]:
            Construct request:
                input = C_token_prefix + N_random_new_tokens  (total = C+N)
                max_tokens = O
                kv_transfer_params = {do_remote_prefill: true, ...}
            
            Send request through proxy → P → D
            Collect all timestamps (t0..t10)
            Repeat 5 times
            
            Record breakdown
    
    Evict D's cache (restart or send cache-clearing requests)

D-Side Cache Verification

Before each measurement, verify D's cache state:

# Check that D has exactly C tokens cached
resp = httpx.get(f"http://{d_host}:{d_port}/metrics")
# Parse vllm:prefix_cache_hit_rate or gpu_prefix_cache_hit_rate
# Or use internal API to query cached block count

vLLM Instrumentation Patch

Minimal patch to mooncake_connector.py for timestamp collection:

# Add at top
import time
_PROFILE_LOG = []  # or write to file

# In send_kv_to_decode(), around line 800-990:
async def send_kv_to_decode(self, ...):
    t_ready = time.perf_counter_ns()  # P prefill done, ready to send
    
    # ... ZMQ receive metadata from D ...
    t_zmq_recv = time.perf_counter_ns()
    
    # ... build transfer params ...
    t_params_built = time.perf_counter_ns()
    
    ret_value = self.engine.batch_transfer_sync_write(...)
    t_rdma_done = time.perf_counter_ns()
    
    # ... send ZMQ response ...
    t_zmq_sent = time.perf_counter_ns()
    
    _PROFILE_LOG.append({
        "req_id": req_id,
        "bytes": sum(lengths),
        "num_ops": len(src_ptrs),
        "t_ready": t_ready,
        "t_zmq_recv": t_zmq_recv,
        "t_params_built": t_params_built, 
        "t_rdma_done": t_rdma_done,
        "t_zmq_sent": t_zmq_sent,
    })

Similar patches needed in:

scheduler.py: Log t_schedule_start, t_promote
process_pulling_result(): Log t_recv_complete

Output Format

Per-Request Record (`results/lifecycle/C{c}_N{n}_O{o}_rep{r}.json`)

{
    "config": {
        "prior_context": 32000,
        "current_new_tokens": 8192,
        "output_length": 128,
        "total_input_length": 40192
    },
    "timestamps_ns": {
        "t0_proxy_recv": 1000000000,
        "t1_proxy_dispatch": 1000050000,
        "t2_p_schedule": 1000200000,
        "t3_p_prefill_done": 1001100000,
        "t4_zmq_metadata": 1001150000,
        "t5_rdma_start": 1001200000,
        "t6_rdma_complete": 1002300000,
        "t7_d_recv_signal": 1002350000,
        "t8_d_promoted": 1002500000,
        "t9_d_first_token": 1002600000,
        "t10_d_last_token": 1003800000
    },
    "breakdown_ms": {
        "routing": 0.05,
        "p_queue": 0.15,
        "p_prefill": 0.90,
        "zmq_handshake": 0.05,
        "rdma_transfer": 1.10,
        "transfer_signal": 0.05,
        "d_promotion": 0.15,
        "d_first_token": 0.10,
        "d_decode": 1.20,
        "transfer_total": 1.20,
        "e2e": 3.80
    },
    "transfer_detail": {
        "bytes_transferred": 268435456,
        "num_rdma_ops": 512,
        "num_blocks": 512,
        "num_layers": 32,
        "build_params_ms": 0.8,
        "rdma_write_ms": 1100.0,
        "effective_bw_gbps": 195.2
    }
}

Aggregated Summary (`results/lifecycle/summary.csv`)

prior_context,new_tokens,output_length,p_prefill_ms,rdma_transfer_ms,transfer_total_ms,d_decode_ms,e2e_ms,bytes_GB,bw_gbps,ttft_overhead_ms
0,8192,128,890,1100,1200,480,2620,0.268,195,1200
32000,8192,128,890,1100,1200,480,2620,0.268,195,1200
64000,8192,128,890,1100,1200,480,2620,0.268,195,1200
0,32768,128,3200,4400,4500,480,8230,1.073,195,4500

Analysis Deliverables

1. Stacked Bar Chart: Lifecycle Breakdown vs N (new tokens)

Separate subplot rows for each prior_context value.

2. Transfer Bandwidth Characterization

Plot effective_bandwidth vs bytes_transferred:

Expected: bandwidth increases with transfer size (amortizes per-op latency)
Identify the "bandwidth knee" — minimum transfer size for near-peak bandwidth
Compare against theoretical 200 Gbps RDMA limit

3. Transfer Cost Model

Fit: rdma_transfer_ms = α + β × bytes_transferred

α = per-operation fixed cost (ZMQ + scheduling)
β = 1/bandwidth (bytes → time)

4. Overhead vs Co-Located Baseline

For each config, also measure the same request on a combined (no PD-sep) instance:

colo_ttft = time from request to first token on combined instance
pdsep_overhead = pdsep_ttft - colo_ttft

Plot: overhead as function of (C, N) — when does PD-sep become net-negative?

5. Impact of Prior Context on Transfer Volume

Since transfer is incremental:

When C increases (D has more cached), actual bytes_transferred should stay constant (≈ N × per_token_kv_size)
Verify this — if NOT constant, there's a bug in incremental transfer logic
Plot actual bytes_transferred vs C for fixed N

Risks & Mitigations

Risk	Impact	Mitigation
Clock skew between P and D processes	Wrong cross-instance durations	Use single-machine 2-GPU setup, share `time.perf_counter_ns()` clock
P and D scheduler step async	Promotion delayed by step interval	Record D's scheduler step frequency, subtract 0.5×step from d_promotion
Prefix cache eviction during experiment	C not actually cached	Monitor cache metrics, use small enough working set
Mooncake connection pool warmup	First transfer slower	Discard first 2 repetitions, use reps 3-5
vLLM internal queuing at high C+N	OOM or scheduling delays	Monitor `gpu_cache_usage_perc`, keep C+N ≤ 132k

Execution Estimate

Phase	Time
vLLM patch development & validation	2 hours
Per configuration (5 reps × ~10s each)	~50s
Full sweep (144 configs × 50s)	~2 hours
Cache seeding overhead (6 prior_context levels)	~30 min
Analysis & plotting	2 hours
Total	~7 hours

Success Criteria

Breakdown is complete: All phases sum to E2E (residual < 5%)
Transfer dominates: rdma_transfer_ms > p_prefill_ms for N ≥ 4k (confirms current bottleneck hypothesis)
Bandwidth model fits: Linear model R² > 0.95
Incremental verified: bytes_transferred independent of prior_context for fixed N
Overhead quantified: Clear threshold (N, C) where PD-sep overhead exceeds co-located execution

13 KiB Raw Blame History Unescape Escape